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Abstract 


Multiple applications executing concurrently on a multicore system interfere with each other at 
different shared resources such as main memory and shared caches. Such inter-application inter¬ 
ference, if uncontrolled, results in high system performance degradation and unpredictable appli¬ 
cation slowdowns. While previous work has proposed application-aware memory scheduling as 
a solution to mitigate inter-application interference and improve system performance, previously 
proposed memory scheduling techniques incur high hardware complexity and unfairly slowdown 
some applications. Furthermore, previously proposed memory-interference mitigation techniques 
are not designed to precisely control application performance. 

This dissertation seeks to achieve high and controllable performance in multicore systems by 
mitigating and quantifying the impact of shared resource interference. First, towards mitigating 
memory interference and achieving high performance, we propose the Blacklisting memory sched¬ 
uler. We observe that ranking applications individually with a total order based on memory access 
characteristics, like previous schedulers do, leads to high hardware cost, while also causing un¬ 
fair application slowdowns. The Blacklisting memory scheduler overcomes these shortcomings 
based on two key observations. First, we observe that, to mitigate interference, it is sufficient 
to separate applications into only two groups, one containing applications that are vulnerable to 
interference and another containing applications that cause interference, instead of ranking in¬ 
dividual applications with a total order. Vulnerable-to-interference group is prioritized over the 
interference-causing group. Second, we show that this grouping can be efficiently performed by 
simply counting the number of consecutive requests served from each application - an application 
that has a large number of consecutive requests served is dynamically classified as interference- 



causing. The Blacklisting memory scheduler, designed based on these insights, achieves high sys¬ 
tem performance and fairness, while incurring significantly lower complexity than state-of-the-art 
application-aware schedulers. 

Next, towards quantifying the impact of memory interference and achieving controllable per¬ 
formance in the presence of memory bandwidth interference, we propose the Memory Interference 
induced Slowdown Estimation (MISE) model. The MISE model estimates application slowdowns 
due to memory interference based on two observations. Eirst, the performance of a memory-bound 
application is roughly proportional to the rate at which its memory requests are served, suggesting 
that request-service-rate can be used as a proxy for performance. Second, when an application’s 
requests are prioritized over all other applications’ requests, the application experiences very lit¬ 
tle interference from other applications. This provides a means for estimating the uninterfered 
request-service-rate of an application while it is run alongside other applications. Using the above 
observations, MISE estimates the slowdown of an application as the ratio of its uninterfered and 
interfered request service rates. We propose simple changes to the above model to estimate the 
slowdown of non-memory-bound applications. We propose and demonstrate two use cases that 
can leverage MISE to provide soft performance guarantees and high overall performance/faimess. 

Einally, we seek to quantify the impact of shared cache interference on application slowdowns, 
in addition to memory bandwidth interference. Towards this end, we propose the Application 
Slowdown Model (ASM). ASM builds on MISE and observes that the performance of an applica¬ 
tion is strongly correlated with the rate at which the application accesses the shared cache. This is 
a more general observation than that of MISE and holds for all applications, thereby enabling the 
estimation of slowdown for any application as the ratio of the uninterfered to the interfered shared 
cache access rate. This reduces the problem of estimating slowdown to estimating the shared cache 
access rate of the application had it been run alone on the system. ASM periodically estimates each 
application’s cache-access-rate-alone by minimizing interference at the main memory and quanti¬ 
fying interference at the shared cache. We propose and demonstrate several use cases of ASM that 
leverage it to provide soft performance guarantees and improve performance and fairness. 
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Chapter 1 


Introduction 

1.1 Problem 

Applications executing concurrently on a multieore ehip eontend with eaeh other to aeeess shared 
resourees sueh as main memory and shared eaehes. The main memory has limited bandwidth, 
driven by eonstraints on pin eounts. If the available shared eaehe eapaeity and memory bandwidth 
are not managed well, different applications can harmfully interfere with eaeh other, resulting 
in signifieant degradation in both system performanee and individual applieation performanee. 
Furthermore, the slowdown experieneed by an applieation due to inter-applieation interferenee at 
these shared resourees depends on the other eoneurrently running applieations and the available 
memory bandwidth and shared cache eapaeity. Hence, different applieations experienee different 
and unpredictable slowdowns. 

Figure |1.1| shows leslieSd, an applieation from the SPEC CPU 2006 suite when it is run with 
two different applieations, goo and mof, on a simulated two-oore system where the oores share a 
main memory ohannel. As can be seen, leslieSd and mcf slow down signifioantly due to shared 
resource interference (goo does not slow down sinoe it is largely compute-bound and does not 
aeeess the main memory mueh). Furthermore, leslieSd slows down by 1.9x when it is run with 


1 


gcc, an application that rarely accesses the main memory. However, leslieSd slows down by 5.4x 
when it is run with mcf, which frequently accesses the main memory. This is one representative 
example demonstrating that an application experiences different slowdowns when run with appli¬ 
cations that have different shared resource access characteristics. We observe this behavior across 
a variety of applications, as also observed by previous works [l82[8^|871|89l, resulting in high and 
unpredictable applications slowdowns. 



(a) leslieSd co-running with gcc 


(b) leslieSd co-running with mcf 


Figure 1.1: leslieSd’s slowdown compared to when run alone 


Inter-application interference and the resultant high application slowdowns are a major prob¬ 
lem in most multicore systems, where multiple applications share resources. Furthermore, the 
unpredictable nature of the slowdowns is particularly undesirable in several scenarios where some 
applications are critical and need to meet requirements on their performance. For instance, in a 
data center/virtualized environment where multiple applications, each potentially from a different 
user, are consolidated on the same machine, it is common for each user to require a certain guar¬ 
anteed performance for their application. Another example is in mobile systems where interactive 
and non-interactive jobs share resources and the interactive jobs need to meet deadlines/frame rate 
requirements. Our main research objective is to mitigate and quantify shared resource interfer¬ 
ence, towards the end of achieving high and controllable application performance, through simple 
and implementable slowdown estimation and shared resource management techniques. 
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1.2 Our Solutions 


1.2.1 The Blacklisting Memory Scheduler 

Towards achieving our goal of high system performanee and fairness, we propose the Blacklisting 
memory seheduler, a simple memory seheduler design that is able to aehieve high performanee 
and fairness at low eost by mitigating interference at the main memory. Although the problem of 
memory interferenee mitigation has been mueh explored, with memory request seheduling being 
the prevalent solution direetion, we observe that previously proposed memory sehedulers are both 
eomplex and unfair. The main souree of this eomplexity and unfairness is the notion of ranking 
applieations individually, with a total rank order, based on applications’ memory aeeess eharaeter- 
istics. Computing and enforeing ranks incurs high hardware eomplexity, both in terms of logic and 
storage overhead. As a result, the eritieal path lateney and ehip area of ranking-based applieation- 
aware memory sehedulers is signifieantly higher eompared to applieation-unaware sehedulers. 
For example. Thread Cluster Memory Seheduler (TCM) [|M]| . a state-of-the-art application-aware 
seheduler is 8x slower and 1.8x larger than a eommonly-employed applieation-unaware scheduler, 
FRFCFS fl97l . Furthermore, when a total order based ranking is employed, applieations that are at 
the bottom of the ranking staek get heavily deprioritized and unfairly slowed down. This greatly 
degrades system fairness. 

In order to overeome these shorteomings of previous ranking-based schedulers, we propose 
the Blaeklisting memory seheduler (BLISS) III081 1 1 0911 based on two new observations. First, in 
eontrast to forming a total rank order of all applieations (as done in prior works), we find that, 
to mitigate interferenee, it is suffieient to i) separate applieations into only two groups, one group 
containing applications that are vulnerable to interferenee and another eontaining applications that 
cause interferenee, and ii) prioritize the requests of the vulnerable-to-interference group over the 
requests of the interference-causing group. Seeond, we observe that applieations ean be efficiently 
elassified as either vulnerable-to-interference or interference-causing by simply eounting the num- 
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ber of consecutive requests served from an application in a short time interval. 

BLISS achieves better system performance and fairness than the best-performing previous 
schedulers, while incurring significantly low complexity. However, BLISS does not tackle the 
problem of unpredictable application slowdowns. 

1.2.2 The Memory Interference induced Slowdown Estimation (MISE) Model 

Towards tackling the problem of unpredictable application slowdowns, we first propose to esti¬ 
mate/quantify and control application slowdowns in the presence of interference at the main mem¬ 
ory. First, we estimate application slowdowns using the Memory Interference induced Slowdown 
Estimation (MISE) model UlllH . The MISE model accurately estimates application slowdowns 
based on two key observations. Eirst, the performance of a memory-bound application is roughly 
proportional to the rate at which its memory requests are served. This observation suggests that 
we can use request-service-rate as a proxy for performance, for memory-bound applications. As 
a result, slowdown of such an application can be computed as the ratio of the request-service-rate 
when the application is run alone on a system to that when it is run alongside other interfering ap¬ 
plications. Second, the alone-request-service-rate of an application can be estimated by giving the 
application’s requests the highest priority in accessing memory. Giving an application’s requests 
the highest priority in accessing memory results in very little interference from other applica¬ 
tions’ requests. As a result, most of the application’s requests are served as though the application 
has all the memory bandwidth for itself, allowing the system to gather a good estimate for the 
alone-request-service-rate of the application. We adapt these observations and extend the model to 
estimate slowdowns of applications that are not bound at memory too. 

Accurate slowdown estimates from the MISE model can enable several mechanisms to achieve 
both high and controllable performance. We build two such mechanisms on top of our proposed 
model to demonstrate its effectiveness. 
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1.2.3 The Application Slowdown Model (ASM) 


The MISE model estimates slowdowns due to interference at the main memory. However, it does 
not take into account interference at the shared caches and assumes caches are private. The Ap¬ 
plication Slowdown Model (ASM) [UlOH estimates slowdowns due to both shared cache and main 
memory interference. ASM does so by exploiting the observation that the performance of each 
application is roughly proportional to the rate at which it accesses the shared cache. This ob¬ 
servation builds on MISE’s observation on correlation between memory request service rate and 
performance. However, it is more general and applies to all applications, unlike MISE’s observa¬ 
tion that applies only to memory-bound applications. ASM estimates alone-cache-access-rate in 
two steps. Eirst, ASM minimizes interference for an application at the main memory by giving the 
application’s requests the highest priority at the memory controller, similar to MISE. Doing so also 
enables ASM to get an accurate estimate of the average cache miss service time of the application 
had it been run alone (to be used in the next step). Second, ASM quantifies the effect of interfer¬ 
ence at the shared cache by using an auxiliary tag store to determine the number of shared cache 
misses that would have been hits if the application did not share the cache with other applications 
(contention misses). This aggregate contention miss count is used along with the average miss 
service time (from the previous step) to estimate the actual time it would have taken to serve the 
application’s requests had it been run alone. 

We present and evaluate several mechanisms that can leverage ASM’s slowdown estimates to¬ 
wards achieving different goals such as high performance, fairness, bounded application slow¬ 
downs and fair billing, thereby demonstrating the model’s effectiveness. 


1.3 Thesis Statement 

High and controllable performance can be achieved in multicore systems through simple and im- 
plementable mechanisms to mitigate and quantify shared resource interference. 
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1.4 Contributions 


This dissertation makes the following major eontributions: 

• This dissertation makes the observation that it is not neeessary to rank individual appliea- 
tions with a total rank order, like most previous ranking-based applieation-aware memory 
sehedulers do, in order to mitigate interferenee between applieations. This observation en¬ 
ables the design of the Blaeklisting memory seheduler, a low-eomplexity memory sehedul- 
ing teehnique that is able to aehieve high performanee and fairness, by simply eategorizing 
applieations as interferenee-eausing or vulnerable. 

• This dissertation makes the observation that an applieation’s performanee is roughly pro¬ 
portional to the rate at whieh requests are generated to/served at a shared resouree. This 
observation ean serve as a general prineiple enabling the estimation of progress/slowdowns 
at different shared resourees. 

• This dissertation presents the Memory Interferenee indueed Slowdown Estimation (MISE) 
model that aeeurately estimates applieation slowdowns in the presenee of memory interfer¬ 
enee as the ratio of uninterfered to interfered request serviee rates, based on the eorrelation 
between request serviee rate and performanee. 

• This dissertation presents the Applieation Slowdown Model (ASM) that aeeurately estimates 
applieation slowdowns due to both shared eaehe and main memory interferenee, by minimiz¬ 
ing interferenee at the main memory and quantifying interferenee at the shared eaehe. 

• This dissertation builds several resouree management meehanisms on top of MISE and ASM 
that leverage their slowdown estimates to provide high performanee, fairness and bounded 
slowdowns, demonstrating MISE/ASM’s effeetiveness in estimating slowdowns. 
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1.5 Dissertation Outline 


This dissertation is organized into eight ehapters. Chapter [^presents background on memory sys¬ 
tem organization and discusses related prior work on shared resource management and providing 
Quality of Service (QoS). Chapter presents the design of the Blacklisting memory scheduler 
(BLISS) and evaluates it against state-of-the-art memory request schedulers. Chapter presents 
the Memory Interference induced Slowdown Estimation (MISE) model and its evaluation against 
previous slowdown estimation techniques. Chapter presents memory bandwidth management 
schemes that leverage the MISE model to provide bounded application slowdowns and fairness. 
Chapter presents the Application Slowdown Model (ASM) and compares it against previous 
schemes that estimate slowdown due to both shared cache and main memory interference. Chap¬ 
ter [^presents several use cases that leverage slowdown estimates from ASM to provide high per¬ 
formance, fairness and bounded slowdowns. Einally, Chapter presents conclusions and future 
research directions that are enabled by this dissertation. 
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Chapter 2 


Background and Related Prior Work 


The problem of shared resouree interferenee has been a signifieant deterrent to aehieving high 
and eontrollable system performanee. Not surprisingly, several previous works have attempted 
to mitigate interference at both the shared caches and main memory, with the goal of improving 
system performance. However, few previous works have tackled the problem of unpredictable 
application slowdowns in the presence of shared resource interference. 

In this chapter, we will first provide a brief background on DRAM main memory organization 
and discuss previous proposals in different related areas, namely memory interference mitigation, 
DRAM optimizations to improve system performance, shared cache capacity management. Quality 
of Service (QoS) and slowdown estimation. 


2.1 DRAM Main Memory Organization 


The DRAM main memory system is organized as channels, ranks and banks hierarchically as 


shown in Figure 2.1 Channels are independent and can operate completely in parallel. Each 
channel consists of ranks (typically 1-4) that share the command and data bus of the channel. 


A rank consists of multiple banks. The banks can operate in parallel. However, all banks within 
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Rank 

Figure 2.1: DRAM main memory organization 


a channel share the command and data bus of the channel. Each bank, in turn, is organized as an 
array of rows and columns. On a data access, the entire row containing the data is brought into 
an internal structure called the row-buffer. Therefore, a subsequent access to the same row can be 
served in the row-buffer itself and need not access the array. This is called a row hit. On an access 
to a different row though, the array needs to be accessed. Such an access is called a row miss. A 
row hit is served ~2x faster than a row miss [l47ll . Please refer to W2\ ITOl l66l for more detail on 
DRAM operation. 

Commonly employed memory controllers employ a memory scheduling policy called First 
Ready First Come First Served (FRFCFS) [11291 [971 that leverages the row buffer by prioritiz¬ 
ing row hits over row misses/conflicts. Older requests are then prioritized over newer requests. 
FRFCFS aims to maximize DRAM throughput by prioritizing row hits. However, it unfairly prior¬ 
itizes requests of applications that generate a large number of requests to the same row (high-row- 
buffer-locality) and access memory frequently (high-memory-intensity) ([82118^ . 
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2.2 Related Work on Memory Scheduling 


Much prior work has focused on mitigating this unfairness and inter-application interference at the 
main memory, with the goals of improving system performance and fairness, of which a predom¬ 
inant solution direction is memory request scheduling. Several previous works [87l l83l [60l 
El SB [Ml have proposed application-aware memory scheduling techniques that take into account 
the memory access characteristics of applications and schedule requests appropriately in order to 
mitigate inter-application interference and improve system performance and fairness. 

Mutlu and Moscibroda propose PARES [[83, an application-aware memory scheduler that 
batches the oldest requests from applications and prioritizes the batched requests, with the goals 
of preventing starvation and improving fairness. Within each batch, PARES ranks individual ap¬ 
plications based on the number of outstanding requests from the application and, using this total 
rank order, prioritizes requests of applications that have low-memory-intensity to improve system 
throughput. Kim et al. [I60l observe that applications that receive low memory service tend to expe¬ 
rience interference from applications that receive high memory service. Eased on this observation, 
they propose ATLAS, an application-aware memory scheduler that ranks individual applications 
based on the amount of long-term memory service each application receives and prioritizes appli¬ 
cations that receive low memory service, with the goal of improving overall system throughput. 

Another recently proposed memory scheduling technique. Thread cluster memory scheduling 
(TCM) lEII ranks individual applications by memory intensity such that low-memory-intensity 
applications are prioritized over high-memory-intensity applications (to improve system through¬ 
put). Kim et al. [El also observed that ranking all applications based on memory intensity and 
prioritizing low-memory-intensity applications could slow down the deprioritized high-memory- 
intensity applications significantly and unfairly. This is because when all applications are ranked 
by memory service, applications with high memory intensities are ranked lower, as they inherently 
tend to have high memory service, as compared to other applications. With the goal of mitigat- 
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ing this unfairness, TCM clusters applications into low- and high-memory-intensity clusters. In 
the low-memory-intensity cluster, applications are ranked by memory- intensity, whereas, in the 
high-memory-intensity cluster, applications’ ranks are shuffled randomly to provide fairness. Both 
clusters employ a total rank order among applications at any given time. 

More recently, Ghose et al. [|^ propose a memory scheduler that aims to prioritize critical 
memory requests that stall the instruction window for long lengths of time. The scheduler predicts 
the criticality of a load instruction based on how long it has stalled the instruction window in the 
past (using the instruction address (PC)) and prioritizes requests from load instructions that have 
large total and maximum stall times measured over a period of time. Although this scheduler is 
not application-aware, we compare to it as it is the most recent scheduler that aims to maximize 
performance by mitigating memory interference. 

All these state-of-the-art schedulers incur significant hardware complexity and cost to rank ap¬ 
plications based on their memory access characteristics and prioritize requests based on this rank¬ 
ing. This results in significant increase in critical path latency and area, as we discuss in Chapter]^ 

2.3 Related Complementary Memory Scheduling Proposals 

Parallel Application Memory Scheduling (PAMS) [|^ tackles the problem of mitigating inter¬ 
ference between different threads of a multithreaded application, while Staged Memory Schedul¬ 
ing (SMS) liTOl attempts to mitigate interference between the CPU and GPU in CPU-GPU sys¬ 
tems. Principles from our work can be employed in both of these contexts to identify and de- 
prioritize interference-causing threads, thereby mitigating interference experienced by vulnerable 
threads/applications. Complexity effective memory access scheduling HI2411 attempts to achieve 
the performance of FRFCFS using a First Come First Served scheduler in GPU systems, by pre¬ 
venting row-buffer locality from being destroyed when data is transmitted over the on-chip net¬ 
work. This proposal is complementary to our proposals and can be combined with our techniques 
that prevent threads from hogging the row-buffer and banks. Ipek et al. iHTI propose a memory 
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controller design that employs maehine learning teehniques (reinforeement learning) to maximize 
DRAM throughput. While sueh a poliey eould learn applieations’ memory aeeess eharaeteristies 
over time and appropriately optimize its seheduling poliey to improve performanee, implementing 
maehine learning teehniques in the memory eontroller hardware eould inerease eomplexity. 

Several previous works have taekled the problem of seheduling write baek requests to memory. 
Stueeheli et al. 1110711 and Lee et al. [[64ll propose to sehedule write baeks sueh that requests to the 
same row are seheduled together to exploit row-buffer loeality. Seshadri et al. H 10011 exploit their 
proposed dirty-bloek index strueture to identify dirty eaehe bloeks from the same row, enabling 
a simpler implementation of row-loeality-aware write baek. Zhao et al. [I126H propose request 
seheduling meehanisms to taekle the problem of heavy write traffie in persistent memory systems. 
Our teehniques ean be eombined with these different write handling meehanisms to aehieve better 
fairness and performanee. 

Previous works have also taekled the problem of memory management and request seheduling 
in the presenee of prefeteh requests. Lee et al. Il6^ propose to dynamieally prioritize/deprioritize 
prefeteh requests based on prefeteher aeeuraey. Lee et al. [|65l also propose to sehedule requests 
aeeordingly to take advantage of the memory-level parallelism in the system, in the presenee of 
prefeteh requests. Ebrahimi et al. Il2^ propose to ineorporate prefeteher awareness, based on 
monitoring prefeteher aeeuraey, into previously proposed fair memory sehedulers sueh as PARES. 
These meehanisms ean be eombined with our proposals sueh as BLISS and the memory bandwidth 
alloeation polieies we build on top of MISE and ASM, to ineorporate prefeteh-awareness. 

2.4 Other Related Work on Memory Interference Mitigation 

While memory seheduling is a major solution direetion towards mitigating interferenee, previous 
works have also explored other approaehes sueh as address interleaving (SSI, memory bank/ehannel 
partitioning [l84l[5011711II221 ISTll . souree throttling ll27llll3[[T3ll20ll^[90ll55ll and thread sehedul¬ 
ing II128[I1121I2T1I118II to mitigate interferenee. 
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Subrow Interleaving: Kaseridis et al. Il53ll propose minimalist open page, a data mapping poliey 
that interleaves data at the granularity of a sub-row aeross ehannels and banks sueh that applieations 
with high row-buffer loeality are prevented from hogging the row buffer, while still preserving 
some amount of row-buffer-loeality. 

Memory Channel/Bank Partitioning: Previous works (Hll |50l |7T1 11221 1571 propose teehniques 
to mitigate inter-applieation interferenee by partitioning ehannels/banks among applieations sueh 
that the data of interfering applieations are mapped to different ehannels/banks. 

Source Throttling: Souree throttling teehniques (e.g., [l27llll3[[T3ll20ll^l90l[55l[TTl l propose 
to throttle the memory request injeetion rates of interferenee-eausing applieations at the proeessor 
eore itself rather than regulating an applieation’s aeeess behavior at the memory, unlike memory 
seheduling, partitioning or interleaving. Other previous work by Ebrahimi et al. [|26l proposes to 
tune shared resouree management polieies sueh as FST [|27l to be aware of prefeteh requests. 

OS Thread Scheduling: Previous works [|128[ I112[ 11181 propose to mitigate shared resouree 
eontention by eo-seheduling threads that interaet well and interfere less at the shared resourees. 
Sueh a solution relies on the presenee of enough threads with sueh symbiotie properties. Other 
teehniques [ITTI propose to map applieations to eores to mitigate memory interferenee. 

Our proposals to mitigate memory interferenee, with the goals of providing high performanee 
and fairness, ean be eombined with these solution approaehes in a synergistie manner to aehieve 
better mitigation and eonsequently, higher performanee and fairness. 


2.5 Related Work on DRAM Optimizations to Improve Perfor¬ 
mance 

Several prior works have proposed optimizations to DRAM (internals) to enable more parallelism 
within DRAM, thereby improving performanee. Kim et al. Il62l propose teehniques to enable 
aeeess to multiple DRAM sub-arrays in parallel, thereby overlapping the lateneies of these paral- 
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lei accesses. Lee et al. in Il66ll observe that long bitlines contribute to high access latencies and 
propose to split bitlines into two shorter segments (using an isolation transistor), enabling faster 
access to one of the shorter segments. More recently, Lee at al. Wll propose to relax DRAM 
timing constraints in order to optimize for performance in the common case. Multiple previous 
works [I127[ r7ll^ have proposed to partition a DRAM rank, enabling parallel access to these parti¬ 
tioned ranks. These techniques are complementary to memory interference mitigation techniques 
and can be combined with them to achieve high performance benefits. 


2.6 Related Work on Shared Cache Capacity Management 

The management of shared cache capacity among multiple contending applications is a much 
explored area. A large body of previous research has focused on improving the shared cache 
replacement policy [jSSl SB EB |99l . These proposals use different techniques to predict which 
cache blocks would have high reuse and try to retain such blocks in the cache. Furthermore, some 
of these proposals also attempt to retain at least part of the working set in the cache when an 
application’s working set is much larger than the cache size. A number of cache insertion policies 
have also been studied by previous proposals [I5nil01[[94lll21[|45]l . These policies use information 
such as the memory region of an accessed address, instruction pointer to predict the reuse behavior 
of a missed cache block and insert blocks with higher reuse closer to the most recently used position 
such that these blocks are not evicted immediately. Other previous works [HB |B 110611191 l4B l59ll 
propose to partition the cache between applications such that applications that have better utility for 
the cache are allocated more cache space. While these previous proposals aim to improve system 
performance, they are not designed with the objective of providing controllable performance. 
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2.7 Related Work on Coordinated Cache and Memory Man¬ 


agement 

While several previous works have proposed teehniques to manage the shared eache capaeity and 
main memory bandwidth independently, there have been few previous works that have eoordinated 
the management of these resourees. Bitirgen et al. liT4l propose a eoordinated resouree manage¬ 
ment seheme that employs maehine learning, speeifieally, an artifieial neural network, to prediet 
each application’s performance for different possible resource allocations. Resources are then al¬ 
located appropriately to different applications such that a global system performance metric is 
optimized. More recently, Wang et al. [I119II employ a market-dynamics-inspired mechanism to 
coordinate allocation decisions across resources. We take a different and more general approach 
and propose a model that accurately estimates application slowdowns. Our model can be used as 
an effective substrate to build coordinated resource allocation policies that leverage our slowdown 
estimates to achieve different goals such as high performance, fairness and controllable perfor¬ 
mance. 


2.8 Related Work on Cache and Memory QoS 

Several prior works have attempted to provide QoS guarantees in shared memory multicore sys¬ 
tems. Previous works have proposed techniques to estimate applications’ sensitivity to interfer¬ 
ence/propensity to cause interference by profiling applications offline (e.g., [r77l[3ni29ll30ll '). How¬ 
ever, in several scenarios, such offline profiling of applications might not be feasible or accurate. 
For instance, in a cloud service, where any user can run a job using the available resources in a 
pay-as-you-go manner, profiling every application offline to gain a priori application knowledge 
can be prohibitive. In other cases, where the resource usage of an application is heavily input 
set dependent, the profile may not be representative. Mars et al. 1112311 also attempt to estimate 
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applications’ sensitivity to/propensity to cause interference online. However, they assume that ap¬ 
plications run by themselves at different points in time, allowing for such profiling, which might 
not necessarily be true for all applications and systems. Our techniques, on the other hand, strive to 
control and bound application slowdowns without relying on any offline profiling and are therefore 
more generally applicable to different systems and scenarios. 

Iyer et al. lEH |43l |44l|, Guo et al. [l37l propose mechanisms to provide guarantees on shared 
cache space, memory bandwidth or IPC for different applications. Kasture and Sanchez Il5^ 
propose to partition shared caches with the goal of reducing the tail latency of latency critical 
workloads. Nesbit et al. Ii89l propose a mechanism to enforce a memory bandwidth allocation 
policy - partition the available memory bandwidth across concurrently running applications based 
on a given bandwidth allocation. Most of these policies aim to provide guarantees on resource 
allocation. Our goal, on the other hand, is to provide soft guarantees on application slowdowns. 

2.9 Related Work on Storage QoS 

A large body of previous work has tackled the challenge of providing QoS in the presence of con¬ 
tention between different applications for storage bandwidth. Several systems employ bandwidth- 
based throttling (e.g., 11201 l52l l to ensure that some applications do not hog storage band¬ 

width, at the cost of degrading other applications’ performance. One such system, YFQ IIT^ 
controls the proportions of bandwidth different applications receive by assigning priority. Other 
systems such as SLEDS llT^ and Zygaria HI 2011 employ a leaky bucket type model that controls 
the bandwidth of each workload, while provisioning for some burstiness. 

Other systems employ deadline-based throttling (e.g., BMl 1102117411 1 that attempts to provide 
latency guarantees for each request. RT-FS ifSTl uses the notion of slack to provide more resources 
to other applications. Cello H 10211 deals with two kinds of requests, ones that need to meet real-time 
latency requirements and others that do not need to meet such requirements. Cello tries to balance 
the needs of these two kinds of requests. Facade ll74l tailors its latency guarantees depending on 
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an application’s demand in terms of number of requests. More reeent work sueh as Argon II116II 
takes into aeeount that the system eould be oversubseribed and determines feasibility of meeting 
utilization requirements and then seeks to provide guarantees in terms of utilization. 

While all these previous works are effeetive in providing different kinds of QoS at the storage, 
they do not take into aeeount main memory bandwidth and shared eaehe eapaeity eontention, whieh 
is the foeus of our work. 


2.10 Related Work on Interconnect QoS 

Several previous works have taekled the problem of aehieving QoS in the eontext of both off-ehip 
and on-ehip networks. Fair queueing [|24ll emulates round-robin serviee order among different 
flows. Virtual eloek I1125II provides a deadline-based seheme that effeetively time-division multi¬ 
plexes slots among different flows. While these approaehes are rate-based, other previous works 
are frame-based. Time is divided into epoehs or frames and different flows reserve slots within 
a frame. Some examples of frame-based polieies are rotated eombined queueing [i58ll and glob¬ 
ally synehronized frames (Ml- Other previous work [I105H proposes simple bandwidth alloeation 
sehemes that reduee the eomplexity of alloeation in the intermediate router nodes. 

Grot et al. [|36l propose the preemptive virtual eloek meehanism that enables reelamation of 
idle resourees, without adding signifieant buffer overhead. This meehanism preempts low-priority 
requests in order to provide better QoS to higher priority requests. Grot et al. also propose Kilo- 
NOC OSlI . an NoC arehiteeture designed to be sealable to large systems. This proposal reduees 
the amount of hardware ehanges required at every node, aehieving low router eomplexity. Das et 
al. in [|22l propose to employ stall time eritieality information to distinguish between and prioritize 
different applieations’ paekets at routers. Das et al. also propose Aergia [|^ to further distinguish 
between paekets of the same applieation, based on slaek. 

Our work on eaehe and memory QoS ean be eombined with these previous works on intereon- 
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nect QoS to achieve eomprehensive and effeetive QoS at the system level. 


2.11 Related Work on Online Slowdown Estimation 

Eyerman and Eeekhout [l^ and Cazorla et al. lITTII propose meehanisms to determine an appliea- 
tion’s slowdown while it is running alongside other applieations on an SMT proeessor. Euque et 
al. [1761 estimate applieation slowdowns in the presenee of shared eaehe interferenee. Both these 
studies assume a fixed lateney for aeeessing main memory, and henee do not take into aeeount 
interferenee at the main memory. 

While a large body of previous work has foeused on main memory and shared eaehe interfer¬ 
enee reduetion teehniques, few previous works have proposed teehniques to estimate applieation 
slowdowns in the presenee of main memory and eaehe interferenee. 

Ei et al [|69]| propose a seheme to estimate the impaet of memory stall times on performanee, 
for different applieations, in the eontext of hybrid memory system with DRAM and phase ehange 
memory (PCM). The goal of this work is to leverage this performanee estimation seheme to map 
pages appropriately to DRAM and PCM with the goal of improving performanee. Henee, this 
seheme does not foeus mueh on very aeeurate performanee estimation. 

Stall Time Pair Memory Seheduling (STEM) [f86ll is one previous work that attempts to estimate 
eaeh applieation’s slowdown indueed by memory interferenee, with the goal of improving fairness 
by prioritizing the most slowed down applieation. STEM estimates an applieation’s slowdown as 
the ratio of its memory stall time when it is run alone versus when it is eoneurrently run alongside 
other applieations. 

Pairness via Souree Throttling (PST) [|271 and Per-thread eyele aeeounting (PTCA) [[25ll es¬ 
timate applieation slowdowns due to both shared eaehe eapaeity and main memory bandwidth 
interferenee. They eompute slowdown as the ratio of alone and shared exeeution times and esti¬ 
mate alone exeeution time by determining the number of eyeles by whieh eaeh request is delayed. 
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Both FST and PTCA use a mechanism similar to STFM to quantify interference at the main mem¬ 
ory. To quantify interference at the shared cache, both mechanisms determine which accesses of an 
application miss in the shared cache but would have been hits had the application been run alone 
on the system (contention misses), and compute the number of additional cycles taken to serve 
each contention miss. The main difference between FST and PTCA is in the mechanism they use 
to identify a contention miss. FST uses a pollution filter for each application that tracks the blocks 
of the application that were evicted by other applications. Any access that misses in the cache 
and hits in the pollution filter is considered a contention miss. On the other hand, PTCA uses an 
auxiliary tag store for each application that tracks the state of the cache had the application been 
running alone on the system. PTCA classifies any access that misses in the cache and hits in the 
auxiliary tag store as a contention miss. 

The challenge in all these approaches is in determining the alone stall time or execution time 
of an application while the application is actually running alongside other applications. STFM, 
FST and PTCA attempt to address this challenge by counting the number of cycles by which each 
individual request that stalls execution impacts execution time. This is fundamentally difficult 
and results in high inaccuracies in slowdown estimation, as we will describe in more detail in 
Chapters]^ and 


19 


Chapter 3 


Mitigating Memory Bandwidth Interference 
Towards Achieving High Performance 


The prevalent solution direetion to taekle the problem of memory bandwidth interferenee is 
applieation-aware memory request seheduling, as we deseribe in Chapter State-of-the-art 
applieation-aware memory sehedulers attempt to aehieve two main goals - high system perfor- 
manee and high fairness. However, previous sehedulers have two major shortcomings. First, these 
schedulers increase hardware complexity in order to achieve high system performance and fair¬ 
ness. Specifically, most of these schedulers rank individual applications with a total order, based 
on their memory access characteristics (e.g., [[871 Ell |60l IM1)- Scheduling requests based on a 
total rank order incurs high hardware complexity, slowing down the memory scheduler signifi¬ 
cantly. For instance, the critical path latency for TCM increases by 8x (area increases by 1.8x) 


compared to an application-unaware FRFCFS scheduler, as we demonstrate in Section 3.5.2 Such 
high critical path delays in the scheduler directly increase the time it takes to schedule a request, 
potentially making the memory controller latency a bottleneck. Second, a total-order ranking is 
unfair to applications at the bottom of the ranking stack. Even shuffling the ranks periodically (like 
TCM does) does not fully mitigate the unfairness and slowdowns experienced by an application 
when it is at the bottom of the ranking stack, as we describe in more detail in Section [3Tj 
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Figure [XT] compares four major previous schedulers using a three-dimensional plot with perfor¬ 
mance, fairness and simplicity on three different axesj^ On the fairness axis, we plot the negative 
of maximum slowdown, and on the simplicity axis, we plot the negative of critical path latency. 
Hence, the ideal scheduler would have high performance, fairness and simplicity, as indicated by 
the black triangle. As can be seen, previous ranking-based schedulers, PARES, ATLAS and TCM, 
increase complexity significantly, compared to the currently employed FRFCFS scheduler, in order 
to achieve high performance and/or fairness. 


Fairness 

(negative of —FRFCFS 



(negative of critical 
path latency) 


Figure 3.1: Performance vs. fairness vs. simplicity 


Our goal, in this work, is to design a new memory scheduler that does not suffer from these 
shortcomings: one that achieves high system performance and fairness while incurring low hard¬ 
ware cost and complexity. To this end, we seek to overcome these shortcomings by exploring an 
alternative means to protecting vulnerable applications from interference and propose the Black¬ 
listing memory scheduler (BLISS). 


'Results across 80 simulated workloads on a 24-core, 4-channel system. Section 3.4 describes our methodology 
and metrics. 
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3.1 Key Observations 


We build our Blacklisting memory scheduler (BLISS) based on two key observations. 

Observation 1. Separating applications into only two groups (interference-causing and 
vulnerable-to-interference), without ranking individual applications using a total order, is suffi¬ 
cient to mitigate inter-application interference. This leads to higher performance, fairness and 
lower complexity, all at the same time. 


We observe that applications that are vulnerable to interference can be protected from 
interference-causing applications by simply separating them into two groups, one containing 
interference-causing applications and another containing vulnerable-to-interference applications, 
rather than ranking individual applications with a total order as many state-of-the-art schedulers 
do. To motivate this, we contrast TCM fl^ . which clusters applications into two groups and 
employs a total rank order within each cluster, with a simple scheduling mechanism (Grouping) 
that simply groups applications only into two groups, based on memory intensity (as TCM does), 
and prioritizes the low-intensity group without employing ranking in each group. Grouping uses 
the FRFCFS policy within each group. Figure [T2] shows the number of requests served during a 
100,000 cycle period at intervals of 1,000 cycles, for three representative applications, astar, hm- 
mer and Ibm from the SPEC CPU2006 benchmark suite BUl, using these two schedulers]^ These 
three applications are executed with other applications in a simulated 24-core 4-channel system]^ 


Figureshows that TCM has high variance in the number of requests served across time, with 
very few requests being served during several intervals and many requests being served during a 
few intervals. This behavior is seen in most applications in the high-memory-intensity cluster since 
TCM ranks individual applications with a total order. This ranking causes some high-memory- 


intensity applications’ requests to be prioritized over other high-memory-intensity applications’ 


^All these three applications are in the high-memory-intensity group. We found very similar behavior in all other 
such applications we examined. 


^See Section 


3.4 


for our methodology. 
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requests, at any point in time, resulting in high interferenee. Although TCM periodieally shuffles 
this total-order ranking, we observe that an application benefits from ranking only during those 
periods when it is ranked very high. These very highly ranked periods correspond to the spikes 
in the number of requests served (for TCM) in Figure |3.2| for that application. During the other 
periods of time when an application is ranked lower (i.e., most of the shuffling intervals), only 
a small number of its requests are served, resulting in very slow progress. Therefore, most high- 
memory-intensity applications experience high slowdowns due to the total-order ranking employed 
by TCM. 



U..:UW> U 'L...- .1 ^ 0 ^ ^ 1_UJ ^ 0 - 1 -- lJ 

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 

Execution Time (in 1000s of cycles) Execution Time (in 1000s of cycles) Execution Time (in 1000s of cycles) 


(a) astar (b) hmmer (c) Ibm 

Figure 3.2: Request service distribution over time with TCM and Grouping schedulers 


On the other hand, when applications are separated into only two groups based on memory 
intensity and no per-application ranking is employed within a group, some interference exists 
among applications within each group (due to the application-unaware FRFCFS scheduling in 
each group). In the high-memory-intensity group, this interference contributes to the few low- 
request-service periods seen for Grouping in Figure [3^ However, the request service behavior of 
Grouping is less spiky than that of TCM, resulting in lower memory stall times and a more steady 
and overall higher progress rate for high-memory-intensity applications, as compared to when 
applications are ranked in a total order. In the low-memory-intensity group, there is not much of 
a difference between TCM and Grouping, since applications anyway have low memory intensities 
and hence, do not cause significant interference to each other. Therefore, Grouping results in higher 


system performance and significantly higher fairness than TCM, as shown in Figure 3.3 (across 80 
24-core workloads on a simulated 4-channel system). 
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Figure 3.3: Performance and fairness of Grouping vs. TCM 


Grouping applications into two groups also requires much lower hardware overhead than ranking- 
based schedulers that incur high overhead for computing and enforcing a total rank order for all 
applications. Therefore, grouping can not only achieve better system performance and fairness 
than ranking, but it also can do so while incurring lower hardware cost. However, classifying ap¬ 
plications into two groups at coarse time granularities, on the order of a few million cycles, like 
TCM’s clustering mechanism does (and like what we have evaluated in Figure [X^ , can still cause 
unfair application slowdowns. This is because applications in one group would be deprioritized 
for a long time interval, which is especially dangerous if application behavior changes during the 
interval. Our second observation, which we describe next, minimizes such unfairness and at the 
same time reduces the complexity of grouping even further. 

Observation 2. Applications can be classified into interference-causing and vulnerable-to- 
interference groups by monitoring the number of consecutive requests served from each application 
at the memory controller. This leads to higher fairness and lower complexity, at the same time, than 
grouping schemes that rely on coarse-grained memory intensity measurement. 

Previous work actually attempted to perform grouping, along with ranking, to mitigate inter¬ 
ference. Specifically, TCM [|M1 ranks applications by memory intensity and classifies applica¬ 
tions that make up a certain fraction of the total memory bandwidth usage into a group called the 
low-memory-intensity cluster and the remaining applications into a second group called the high- 
memory-intensity cluster. While employing such a grouping scheme, without ranking individual 
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applications, reduces hardware eomplexity and unfairness eompared to a total order based rank¬ 
ing seheme (as we show in Figure [33] ), it i) can still cause unfair slowdowns due to classifying 
applications into groups at coarse time granularities, which is especially dangerous if application 
behavior changes during an interval, and ii) incurs additional hardware overhead and schedul¬ 
ing latency to compute and rank applications by long-term memory intensity and total memory 
bandwidth usage. 


We propose to perform applieation grouping using a signifieantly simpler, novel seheme: simply 
by eounting the number of requests served from eaeh applieation in a short time interval. Appli- 
eations that have a large number (i.e., above a threshold value) of eonseeutive requests served are 
elassified as interferenee-eausing (this elassifieation is periodieally reset). The rationale behind 
this seheme is that when an applieation has a large number of eonseeutive requests served within a 
short time period, whieh is typieal of applieations with high memory intensity or high row-buffer 
loeality, it delays other applieations’ requests, thereby stalling their progress. Henee, identifying 
and essentially blacklisting sueh interferenee-eausing applieations by plaeing them in a separate 
group and deprioritizing requests of this blaeklisted group ean prevent sueh applieations from hog¬ 
ging the memory bandwidth. As a result, the interferenee experieneed by vulnerable applieations 
is mitigated. The blaeklisting elassifieation is eleared periodieally, at short time intervals (on the 
order of 1000s of eyeles) in order not to deprioritize an applieation for too long of a time period 
to eause unfairness or starvation. Sueh elearing and re-evaluation of applieation elassifieation at 
short time intervals signifieantly reduees unfair applieation slowdowns (as we quantitatively show 


in Seetion 3.5.7), while redueing eomplexity eompared to traeking per-applieation metries sueh as 
memory intensity. 


Summary of Key Observations. In summary, we make two key novel observations that lead 
to our design in Seetion [3^ First, separating applieations into only two groups ean lead to a less 
eomplex and more fair and higher performanee seheduler. Seeond, the two applieation groups ean 
be formed seamlessly by monitoring the number of eonseeutive requests served from an applieation 
and deprioritizing the ones that have too many requests served in a short time interval. 
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3.2 Mechanism 


The design of our Blaeklisting seheduler (BLISS) is based on the two key observations deseribed 
in the previous section. The basic idea behind BLISS is to observe the number of consecutive 
requests served from an application over a short time interval and blacklist applications that have 
a relatively large number of consecutive requests served. The blacklisted (interference-causing) 
and non-blacklisted (vulnerable-to-interference) applications are thus separated into two different 
groups. The memory scheduler then prioritizes the non-blacklisted group over the blacklisted 
group. The two main components of BLISS are i) the blacklisting mechanism and ii) the memory 
scheduling mechanism that schedules requests based on the blacklisting mechanism. We describe 
each in turn. 

3.2.1 The Blacklisting Mechanism 

The blacklisting mechanism needs to keep track of three quantities: 1) the application (i.e., hard¬ 
ware context) ID of the last scheduled request {Application ID^ 2) the number of requests served 
from an application {#Requests Served), and 3) the blacklist status of each application. 

When the memory controller is about to issue a request, it compares the application ID of the 
request with the Application ID of the last scheduled request. 

• If the application IDs of the two requests are the same, the #Requests Served counter is 
incremented. 

• If the application IDs of the two requests are not the same, the #Requests Served counter is 
reset to zero and the Application ID register is updated with the application ID of the request 
that is being issued. 

application here denotes a hardware context. There can be as many applications executing actively as there are 
hardware contexts. Multiple hardware contexts belonging to the same application are considered separate applications 
by our mechanism, but our mechanism can be extended to deal with such multithreaded applications. 
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If the #Requests Served exceeds a Blacklisting Threshold (4 in most of our evaluations): 

• The application with ID Application ID is blacklisted (classified as interference-causing). 

• The #Requests Served counter is reset to zero. 

The blacklist information is cleared periodically after every Clearing Interval (set to 10000 
cycles in our major evaluations). 

3.2.2 Blacklist-Based Memory Scheduling 

Once the blacklist information is computed, it is used to determine the scheduling priority of a 
request. Memory requests are prioritized in the following order: 

1. Non-blacklisted applications’ requests 

2. Row-buffer hit requests 

3. Older requests 

Prioritizing requests of non-blacklisted applications over requests of blacklisted applications miti¬ 
gates interference. Row-buffer hits are then prioritized to optimize DRAM bandwidth utilization. 
Finally, older requests are prioritized over younger requests for forward progress. 

3.3 Implementation 

The Blacklisting memory scheduler requires additional storage (flip flops) and logic over an FR- 
FCFS scheduler to 1) perform blacklisting and 2) prioritize non-blacklisted applications’ requests. 
We analyze the storage and logic cost of it. 
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3.3.1 Storage Cost 


In order to perform blaeklisting, the memory seheduler needs the following storage eomponents: 

• one register to store Application ID (5 bits for 24 applieations) 

• one eounter for #Requests Served (8 bits is more than suffieient for the values of request 
eount threshold N that we observe aehieves high performanee and fairness.) 

• one register to store the Blacklisting Threshold that determines when an applieation should 
be blaeklisted 

• a blaeklist bit veetor to indieate the blaeklist status of eaeh applieation (one bit for eaeh 
hardware eontext) (24 bits for 24 applieations) 

In order to prioritize non-blaeklisted applieations’ requests, the memory eontroller needs to 
store the applieation ID (hardware eontext ID) of eaeh request so it ean determine the blaeklist 
status of the applieation and appropriately sehedule the request. 


3.3.2 Logic Cost 

The memory seheduler requires eomparison logie to 

• determine when an applieation’s #Requests Served exeeeds the Blacklisting Threshold and 
set the bit eorresponding to the applieation in the Blacklist bit veetor. 

• prioritize non-blaeklisted applieations’ requests. 


We provide a detailed quantitative evaluation of the hardware area eost and logie lateney of 


implementing BLISS and previously proposed memory sehedulers, in Seetion 3.5.2 
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3.4 Methodology 


3.4.1 System Configuration 


We model the DRAM memory system using a eyele-level in-house DDR3-SDRAM simulator. The 
simulator was validated against Mieron’s behavioral Verilog model f^Ol and DRAMSim2 [|98l . 
This DDRS simulator is integrated with a cycle-level in-house simulator that models out-of-order 
execution cores, driven by a Pin [l73ll tool at the frontend, Each core has a private cache of 512 
KB size. We present most of our results on a system with the DRAM main memory as the only 
shared resource in order to isolate the effects of memory bandwidth interference on application 
performance. We also present results with shared caches in Section 3.5.11 Table [6^ provides 
more details of our simulated system. We perform most of our studies on a system with 24 cores 
and 4 channels. We provide a sensitivity analysis for a wide range of core and channel counts, in 


Section 3.5.11 Each channel has one rank and each rank has eight banks. We stripe data across 
channels and banks at the granularity of a row. 


Processor 

16-64 cores, 5.3GHz, 3-wide issue, 

8 MSHRs, 128-entry instruction window 

Last-level cache 

64B cache-line, 16-way associative, 

512KB private cache-slice per core 

Memory controller 

128-entry read/write request queue per controller 

Memory 

Timing: DDR3-1066 (8-8-8) (79) 

Organization: 1-8 channels, 1 rank-per-channel, 

8 banks-per-rank, 8 KB row-buffer 


Table 3.1: Configuration of the simulated system 


3.4.2 Workloads 

We perform our main studies using 24-core multiprogrammed workloads made of applications 
from the SPEC CPU2006 suite 0, TPC-C, Matlab and the NAS parallel benchmark suite [|5llj^We 
classify a benchmark as memory-intensive if it has a Misses Per Kilo Instruction (MPKI) greater 

^Each benchmark is single threaded. 
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than 5 and memory-non-intensive otherwise. We eonstruct four eategories of workloads (with 20 
workloads in each category), with 25, 50, 75 and 100 percent of memory-intensive applications. 
This makes up a total of 80 workloads with a range of memory intensities, constructed using ran¬ 
dom combinations of benchmarks, modeling a cloud computing like scenario where workloads 
of various types are consolidated on the same node to improve efficiency. We also evaluate 16-, 
32- and 64- core workloads, with different memory intensities, created using a similar methodol¬ 
ogy as described above for the 24-core workloads. We simulate each workload for 100 mi llion 
representative cycles, as done by previous studies in memory scheduling [fSTlIbOll^ . 

3.4.3 Metrics 

We quantitatively compare BLISS with previous memory schedulers in terms of system perfor¬ 
mance, fairness and complexity. We use the weighted speedup [|2^ |32l 1 10411 metric to measure 
system performance. We use the maximum slowdown metric ^T2\ l60l [Ml I115II to measure un¬ 
fairness. We report the harmonic speedup metric 11751 as another measure of system performance. 
The harmonic speedup metric also serves as a measure of balance between system performance 
and fairness [1751 . We report area in micrometer'^ {um?) and scheduler critical path latency in 
nanoseconds (ns) as measures of complexity. 

3.4.4 RTL Synthesis Methodology 

In order to obtain timing/area results for BLISS and previous schedulers, we implement them in 
Register Transfer Level (RTL), using Verilog. We synthesize the RTL implementations with a 
commercial 32 nm standard cell library, using the Design Compiler tool from Synopsys. 
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3.4.5 Mechanism Parameters 


For BLISS, we use a value of four for Blacklisting Threshold, and a value of 10000 eyeles for 
Clearing Interval. These values provide a good balanee between performanee and fairness, as we 


observe from our sensitivity studies in Seetion 3.5.12 For the other sehedulers, we tuned their 
parameters to aehieve high performanee and fairness on our system eonfigurations and workloads. 
We use a Marking-Cap of 5 for PARES, eap of 4 for FRFCFS-Cap, HistoryWeight of 0.875 for 
ATLAS, ClusterThresh of 0.2 and Shuffleinterval of 1000 eyeles for TCM. 


3.5 Evaluation 


We eompare BLISS with five previously proposed memory sehedulers, FRFCFS, FRFCFS with a 
eap (FRFCFS-Cap) MB, PARES, ATLAS and TCM. FRFCFS-Cap is a modified version of FR¬ 
FCFS that eaps the number of eonseeutive row-buffer hitting requests that ean be served from an 
applieation [l86l . Figure |3.4| shows the average system performanee (weighted speedup and har- 
monie speedup) and unfairness (maximum slowdown) aeross all our workloads. Figure [33] shows 
a pareto plot of weighted speedup and maximum slowdown. We draw three major observations. 
First, BLISS aehieves 5% better weighted speedup, 25% lower maximum slowdown and 19% bet¬ 
ter harmonie speedup than the best performing previous seheduler (in terms of weighted speedup), 
TCM, while redueing the eritieal path and area by 79% and 43% respeetively (as we will show in 
Seetion 3.5.2| ). Therefore, we eonelude that BLISS aehieves both high system performanee and 
fairness, at low hardware eost and eomplexity. 


Seeond, BLISS signifieantly outperforms all these five previous sehedulers in terms of system 
performanee, however, it has 10% higher unfairness than PARES, the previous seheduler with the 
least unfairness. PARES ereates request batehes eontaining the oldest requests from eaeh appli¬ 
eation. Older batehes are prioritized over newer batehes. However, within eaeh bateh, individual 
applieations’ requests are ranked and prioritized based on memory intensity. PARES aims to pre- 
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Figure 3.4: System performanee and fairness of BLISS eompared to previous sehedulers 


serve fairness by batehing older requests, while still employing ranking within a batch to prioritize 
low-memory-intensity applications. We observe that the batching aspect of PARES is quite ef¬ 
fective in mitigating unfairness, although it increases complexity. This unfairness reduction also 
contributes to the high harmonic speedup of PARES. However, batching restricts the amount of re¬ 
quest reordering that can be achieved through ranking. Hence, low-memory-intensity applications 
that would benefit from prioritization via aggressive request reordering have lower performance. 
As a result, PARES has 8% lower weighted speedup than BLISS. Furthermore, PARES has a 6.5x 


longer critical path and greater area than BLISS, as we will show in Section 3.5.2 Therefore, 
we conclude that BLISS achieves better system performance than PARES, at much lower hardware 
cost, while slightly trading off fairness. 


♦ FRFCFS • FRFCFS-Cap ▲ PARES 
X ATLAS TCM ■ Blacklisting 



System Performance 
(Weighted Speedup) 


Figure 3.5: Pareto plot of system performance and fairness 


Third, BLISS has 4% higher unfairness than FRFCFS-Cap, but it also 8% higher performance 


than FRFCFS-Cap. FRFCFS-Cap has higher fairness than BLISS since it restricts the length of 
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only the ongoing row hit streak, whereas blaeklisting an applieation ean deprioritize the applieation 
for a longer time, until the next elearing interval. As a result, FRFCFS-Cap slows down high-row- 
buffer-loeality applieations to a lower degree than BLISS. However, restrieting only the on-going 
streak rather than blaeklisting an interfering applieation for a longer time eauses more interferenee 
to other applieations, degrading system performanee eompared to BLISS. Furthermore, FRFCFS- 
Cap is unable to mitigate interferenee due to applieations with high memory intensity yet low- 
row-buffer-loeality, whereas BLISS is effeetive in mitigating interferenee due to sueh applieations 
as well. Henee, we eonelude that BLISS aehieves higher performanee (weighted speedup) than 
FRFCFS-Cap, while slightly trading off fairness. 

3.5.1 Analysis of Individual Workloads 

In this seetion, we analyze the performanee and fairness for individual workloads, when employ¬ 
ing different sehedulers. Figure [TbI shows the performanee and fairness normalized to the baseline 
FRFCFS seheduler for all our 80 workloads, for BLISS and previous sehedulers, in the form of 
S-eurves UlOlll . The workloads are sorted based on the performanee improvement of BLISS. We 
draw three major observations. First, BLISS aehieves the best performanee among all previous 
sehedulers for most of our workloads. For a few workloads, ATLAS aehieves higher performanee, 
by virtue of always prioritizing applieations that reeeive low memory serviee. However, always 
prioritizing applieations that reeeive low memory serviee ean unfairly slow down applieations with 
high memory intensities, thereby degrading fairness signifieantly (as shown in the maximum slow¬ 
down plot. Figure [3^ bottom). Seeond, BLISS aehieves signifieantly higher fairness than ATLAS 
and TCM, the best-performing previous sehedulers, while also aehieving higher performanee than 
them and approaehes the fairness of the fairest previous sehedulers, PARBS and FRFCFS-Cap. As 
deseribed in the analysis of average performanee and fairness results above, PARBS, by virtue of 
request batehing and FRFCFS-Cap, by virtue of restrieting only the eurrent row hit streak aehieve 
higher fairness (lower maximum slowdown) than BLISS for a number of workloads. However, 
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Figure 3.6: System performanee and fairness for all workloads 


these sehedulers aehieve higher fairness at the eost of lower system performanee, as shown in 
Figure |3.6[ Third, for some workloads with very high memory intensities, the default FRFCFS 
seheduler aehieves the best fairness. This is beeause memory bandwidth beeomes a very searee re- 
souree when the memory intensity of a workload is very high. Henee, prioritizing row hits utili z es 
memory bandwidth effieiently for sueh workloads, thereby resulting in higher fairness. Based on 
these observations, we eonelude that BLISS aehieves the best performanee and a good trade-off 
between fairness and performanee for most of the workloads we examine. 


3.5.2 Hardware Complexity 

Figures [377] and [3^ show the eritieal path lateney and area of five previous sehedulers and BLISS 
for a 24-eore system for every memory ehannel. We draw two major eonelusions. First, previ¬ 
ously proposed ranking-based sehedulers, PARBS/ATLAS/TCM, greatly inerease the eritieal path 
lateney and area of the memory seheduler: by llx/5.3x/8.1x and 2.4x/L7x/L8x respeetively, eom- 
pared to FRFCFS and FRFCFS-Cap, whereas BLISS inereases lateney and area by only L7x and 
3.2% over FRFCFS/FRFCFS-Capj^Seeond, PARBS, ATLAS and TCM eannot meet the stringent 
worst-ease timing requirements posed by the DDR3 and DDR4 standards [|48l l49ll . In the ease 
where every request is a row-buffer hit, the memory eontroller would have to sehedule a request 
^The area numbers are for the lowest value of critical path latency that the scheduler is able to meet. 
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every read-to-read eyele time (tccn), the minimum value of whieh is 4 eyeles for both DDRS and 
DDR4. TCM and ATLAS ean meet this worst-ease timing only until DDR3-800 (read-to-read ey¬ 
ele time of 10 ns) and DDR3-1333 (read-to-read eyele time of 6 ns) respeetively, whereas BLISS 
ean meet the worst-ease timing all the way down to the highest released frequeney for DDR4, 
DDR4-3200 (read-to-read time of 2.5 ns). Henee, the high eritieal path lateney of PARES, ATLAS 
and TCM is a serious impediment to their adoption in today’s and future memory interfaees. Teeh- 
niques like pipelining eould potentially be employed to reduee the eritieal path lateney of these 
previous sehedulers. However, the additional flops required for pipelining would inerease area, 
power and design effort signifieantly. Therefore, we eonelude that BLISS, with its greatly lower 
eomplexity and eost as well as higher system performanee and eompetitive or better fairness, is a 
more effeetive alternative to state-of-the-art applieation-aware memory sehedulers. 

FRFCFS I - 1 ATLAS FRFCFS i - 1 ATLAS 

FRFCFS-Cap i - 1 TCM i - 1 FRFCFS-Cap i - 1 TCM i - 1 

PARBS BLISS PARBS BLISS 
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Figure 3.7: Critieal path: BLISS vs. previous Figure 3.8: Area: BLISS vs. previous sehed- 
sehedulers ulers 




3.5.3 Trade-offs Between Performance, Fairness and Complexity 

In the previous seetions, we studied the performanee, fairness and eomplexity of different sehed¬ 
ulers individually. In this seetion, we will analyze the trade-offs between these metries for differ¬ 
ent sehedulers. Figure [3^ shows a three-dimensional radar plot with performanee, fairness and 
simplieity on three different axes. On the fairness axis, we plot the negative of the maximum slow¬ 
down numbers, and on the simplieity axis, we plot the negative of the eritieal path lateney numbers. 
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Hence, the ideal scheduler would have high performance, fairness and simplicity, as indicated by 
the encompassing, dashed black triangle. 
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Figure 3.9: Performance, fairness and simplicity trade-offs 


We draw three major conclusions about the different schedulers we study. First, application- 
unaware schedulers, such as FRFCFS and FRFCFS-Cap, are simple. However, they have low 
performance and/or fairness. This is because, as described in our performance analysis above, 
FRFCFS allows long streaks of row hits from one application to cause interference to other appli¬ 
cations. FRFCFS-Cap attempts to tackle this problem by restricting the length of the current row hit 
streak. While such a scheme improves fairness, it still does not improve performance significantly. 
Second, application-aware schedulers, such as PARES, ATLAS and TCM, improve performance 
or fairness by ranking based on applications’ memory access characteristics. However, they do so 
at the cost of increasing complexity (reducing simplicity) significantly, since they employ a full 
ordered ranking across all applications. Third, BLISS, achieves high performance and fairness, 
while keeping the design simple, thereby approaching the ideal scheduler design (i.e., leading to 
a triangle that is closer to the ideal triangle). This is because BLISS requires only simple hard- 
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ware changes to the memory controller to blacklist applications that have long streaks of requests 
served, which effectively mitigates interference. Therefore, we conclude that BLISS achieves the 
best trade-off between performance, fairness and simplicity. 


3.5.4 Understanding the Benefits of BLISS 


We present the distribution of the number of consecutive requests served (streaks) from individual 
applications to better understand why BLISS effectively mitigates interference. Figure |XT0| shows 
the distribution of requests served across different streak lengths ranging from 1 to 16 for FRFCFS, 
PARBS, TCM and BLISS for six representative applications from the same 24-core workload^ 
The figure captions indicate the memory intensity, in misses per kilo instruction (MPKI) and row- 


buffer hit rate (RBH) of each application when it is run alone. 
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(f) cactusADM 
(MPKI: 7; RBH: 49%) 


Figure 3.10: Distribution of streak lengths 


value of 16 captures streak lengths 16 and above. 
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Figures 3.10a 3.10b and |3.10c' show the streak length distributions of applications that have a 
tendency to cause interference (libquantum, me/and Ibm). All these applications have high mem¬ 
ory intensity and/or high row-buffer locality. Figures 3.10d[ 3.10e and 3.1 Of show applications 
that are vulnerable to interference (calculix, cactusADM and sphinxS). These applications have 
lower memory intensities and row-buffer localities, compared to the interference-causing appli¬ 
cations. We observe that BLISS shifts the distribution of streak lengths towards the left for the 
interference-causing applications, while it shifts the streak length distribution to the right for the 
interference-prone applications. Hence, BLISS breaks long streaks of consecutive requests for 
interference-causing applications, while enabling longer streaks for vulnerable applications. This 
enables such vulnerable applications to make faster progress, thereby resulting in better system 
performance and fairness. We have observed similar results for most of our workloads. 


3.5.5 Average Request Latency 

In this section, we evaluate the average memory request latency (from when a request is generated 
until when it is served) metric and seek to understand its correlation with performance and fairness. 
Figure [TTT] presents the average memory request latency (from when the request is generated un¬ 
til when it is served) for the five previously proposed memory schedulers and BLISS. Two major 
observations are in order. First, FRFCFS has the lowest average request latency among all the 
schedulers. This is expected since FRFCFS maximizes DRAM throughput by prioritizing row- 
buffer hits. Hence, the number of requests served is maximized overall (across all applications). 
However, maximizing throughput (i.e., minimizing overall average request latency) degrades the 
performance of low-memory-intensity applications, since these applications’ requests are often de¬ 
layed behind row-buffer hits and older requests. This results in degradation in system performance 
and fairness, as shown in Figure |T4| 

Second, ATLAS and TCM, memory schedulers that prioritize requests of low-memory-intensity 
applications by employing a full ordered ranking achieve relatively low average latency. This is 
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Figure 3.11: The Average Request Lateney Metrie 


beeause these sehedulers reduee the lateney of serving requests from lateney-eritical, low-memory- 
intensity applieations signifieantly. Furthermore, prioritizing low-memory-intensity applications’ 
requests does not increase the latency of high-memory-intensity applications significantly. This 
is because high-memory-intensity applications already have high memory access latencies (even 
when run alone) due to queueing delays. Hence, average request latency does not increase much 
from deprioritizing requests of such applications. However, always prioritizing such latency- 
critical applications results in lower memory throughput for high-memory-intensity applications, 
resulting in unfair slowdowns (as we show in Figure [TA] ). Third, memory schedulers that provide 
the best fairness, PARBS, FRFCFS-Cap and BLISS have high average memory latencies. This 
is because these schedulers, while employing techniques to prevent requests of vulnerable appli¬ 
cations with low memory intensity and low row-buffer locality from being delayed, also avoid 
unfairly delaying requests of high-memory-intensity applications. As a result, they do not reduce 
the request service latency of low-memory-intensity applications significantly, at the cost of deny¬ 
ing memory throughput to high-memory-intensity applications, unlike ATLAS or TCM. Based 
on these observations, we conclude that while some applications benefit from low memory ac¬ 
cess latencies, other applications benefit more from higher memory throughput than lower latency. 
Hence, average memory latency is not a suitable metric to estimate system performance or fairness. 
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3.5.6 Impact of Clearing the Blacklist Asynchronously 


The Blacklisting scheduler design we have presented and evaluated so far clears the blacklisting 
information periodically (every 10000 cycles in our evaluations so far), such that all applications 
are removed from the blacklist at the end of a Clearing Interval. In this section, we evaluate an 
alternative design where an individual application is removed from the blacklist Clearing Interval 
cycles after it has been blacklisted (independent of the other applications). In order to implement 
this alternative design, each application would need an additional counter to keep track of the num¬ 
ber of remaining cycles until the application would be removed from the blacklist. This counter 
is set (to the Clearing Interval) when an application is blacklisted and is decremented every cycle 
until it becomes zero. When it becomes zero, the corresponding application is removed from the 
blacklist. We use a Clearing Interval of 10000 cycles for this alternative design as well. 


Table 3.2 shows the system performance and fairness of the original BLISS design (BLISS) 
and the alternative design in which individual applications are removed from the blacklist asyn¬ 
chronously (BLISS-Individual-Clearing). As can be seen, the performance and fairness of the two 
designs are similar. Furthermore, the first design (BLISS) is simpler since it does not need to 
maintain an additional counter for each application. We conclude that the original BLISS design 
is more efficient, in terms of performance, fairness and complexity. 


Metric 

BLISS 

BLISS-Individual-Clearing 

Weighted Speedup 

9.18 

9.12 

Maximum Slowdown 

6.54 

6.60 


Table 3.2: Clearing the blacklist asynchronously 


3.5.7 Comparison with TCM’s Clustering Mechanism 


Figure 3.12 shows the system performance and fairness of BLISS, TCM and TCM’s clustering 
mechanism (TCM-Cluster). TCM-Cluster is a modified version of TCM that performs clustering, 
but does not rank applications within each cluster. We draw two major conclusions. First, TCM- 


40 









Cluster has similar system performanee as BLISS, sinee both BLISS and TCM-Cluster prioritize 
vulnerable applieations by separating them into a group and prioritizing that group rather than 
ranking individual applieations. Seeond, TCM-Cluster has signifieantly higher unfairness eom- 
pared to BLISS. This is beeause TCM-Cluster always deprioritizes high-memory-intensity appli¬ 
eations, regardless of whether or not they are eausing interferenee (as deseribed in Seetion [TT ). 
BLISS, on the other hand, observes an applieation at fine time granularities, independently at every 
memory ehannel and blaeklists an applieation at a ehannel only when it is generating a number of 
eonseeutive requests (i.e., potentially eausing interferenee to other applieations). 



Figure 3.12: Comparison with TCM’s elustering meehanism 


3.5.8 Evaluation of Row Hit Based Blacklisting 


BLISS, by virtue of restrieting the number of eonseeutive requests that are served from an ap- 
plieation, attempts to mitigate the interferenee eaused by both high-memory-intensity and high- 
row-buffer-loeality applieations. In this seetion, we attempt to isolate the benefits from restrieting 
eonseeutive row-buffer hitting requests vs. non-row-buffer hitting requests. To this end, we evalu¬ 
ate the performanee and fairness benefits of a meehanism that plaees an applieation in the blaeklist 
when a eertain number of row-buffer hitting requests (N) to the same row have been served for an 
applieation (we eall this FRFCFS-Cap-Blaeklisting as the seheduler essentially is FRFCFS-Cap 
with blaeklisting). We use an N value of 4 in our evaluations. 


Figure |3.13 eompares the system performanee and fairness of BLISS with FRFCFS-Cap- 
Blaeklisting. We make three major observations. First, FRFCFS-Cap-Blaeklisting has similar 
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system performance as BLISS. On further analysis of individual workloads, we find that FRFCFS- 
Cap-Blacklisting blacklists only applications with high row-buffer locality, causing requests of 
non-blacklisted high-memory-intensity applications to interfere with requests of low-memory- 
intensity applications. However, the performance impact of this interference is offset by the per¬ 
formance improvement of high-memory-intensity applications that are not blacklisted. Second, 
FRFCFS-Cap-Blacklisting has higher unfairness (higher maximum slowdown and lower harmonic 
speedup) than BLISS. This is because the high-memory-intensity applications that are not black¬ 
listed are prioritized over the blacklisted high-row-buffer-locality applications, thereby interfering 
with and slowing down the high-row-buffer-locality applications significantly. Third, FRFCFS- 
Cap-Blacklisting requires a per-bank counter to count and cap the number of row-buffer hits, 
whereas BLISS needs only one counter per-channel to count the number of consecutive requests 
from the same application. Therefore, we conclude that BLISS is more effective in mitigating 
unfairness while incurring lower hardware cost, than the FRFCFS-Cap-Blacklisting scheduler that 
we build combining principles from FRFCFS-Cap and BLISS. 



Figure 3.13: Comparison with FRFCFS-Cap combined with blacklisting 


3.5.9 Comparison with Criticality-Aware Scheduling 

We compare the system performance and fairness of BLISS with those of criticality-aware memory 
schedulers [134|| . The basic idea behind criticality-aware memory scheduling is to prioritize mem¬ 
ory requests from load instructions that have stalled the instruction window for long periods of time 
in the past. Chose et al. fl3^ evaluate prioritizing load requests based on both maximum stall time 
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(Crit-MaxStall) and total stall time (Crit-TotalStall) caused by load instructions in the past. Fig¬ 


ure 


3.14 shows the system performance and fairness of BLISS and the criticality-aware scheduling 


mechanisms, normalized to FRFCFS, across 40 workloads. Two observations are in order. First, 
BLISS significantly outperforms criticality-aware scheduling mechanisms in terms of both system 
performance and fairness. This is because the criticality-aware scheduling mechanisms unfairly 
deprioritize and slow down low-memory-intensity applications that inherently generate fewer re¬ 
quests, since stall times tend to be low for such applications. Second, criticality-aware scheduling 
incurs hardware cost to prioritize requests with higher stall times. Specifically, the number of 
bits to represent stall times is on the order of 12-14, as described in 0411 . Hence, the logic for 
comparing stall times and prioritizing requests with higher stall times would incur even higher 
cost than per-application ranking mechanisms where the number of bits to represent a core’s rank 
grows only as as log 2 NumberOfCores (e.g. 5 bits for a 32-core system). Therefore, we conclude 
that BLISS achieves significantly better system performance and fairness, while incurring lower 
hardware cost. 


FRFCFS 

Crit-MaxStall 

Crit-TotalStall 

BLISS 


FRFCFS 

Crit-MaxStall 

Crit-TotalStall 

BLISS 



Figure 3.14: Comparison with criticality-aware scheduling 


3.5.10 Effect of Workload Memory Intensity and Row-buffer Locality 

In this section, we study the impact of workload memory intensity and row-buffer locality on per¬ 
formance and fairness of BLISS and five previous schedulers. 
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Workload Memory Intensity. Figure [TT?] shows system performance and fairness for workloads 
with different memory intensities, classified into different categories based on the fraction of high- 
memory-intensity applications in a workload]^ We draw three major conclusions. First, BLISS 
outperforms previous memory schedulers in terms of system performance across all intensity cate¬ 
gories. Second, the system performance benefits of BLISS increase with workload memory inten¬ 
sity. This is because as the number of high-memory-intensity applications in a workload increases, 
ranking individual applications, as done by previous schedulers, causes more unfairness and de¬ 
grades system performance. Third, BLISS achieves significantly lower unfairness than previous 
memory schedulers, except FRFCFS-Cap and PARBS, across all intensity categories. Therefore, 
we conclude that BLISS is effective in mitigating interference and improving system performance 
and fairness across workloads with different compositions of high- and low-memory-intensity ap¬ 
plications. 


FRFCFS L 
FRFCFS-Cap c 
PARBS ■ 


ATLAS 

TCM 

BLISS 
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Figure 3.15: Sensitivity to workload memory intensity 


Workload Row-buffer Locality. Figure |3.16| shows the system performance and fairness of five 
previous schedulers and BLISS when the number of high row-buffer locality applications in a 
workload is varied]^ We draw three observations. First, BLISS achieves the best performance 
and close to the best fairness in most row-buffer locality categories. Second, BLISS’ performance 
and fairness benefits over baseline FRFCFS increase as the number of high-row-buffer-locality 
applications in a workload increases. As the number of high-row-buffer-locality applications in a 

^We classify applications with MPKI less than 5 as low-memory-intensity and the rest as high-memory-intensity. 

“^We classify an application as having high row-buffer locality if its row-buffer hit rate is greater than 90%. 
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workload increases, there is more interferenee to the low-row-buffer-loeality applieations that are 
vulnerable. Henee, there is more opportunity for BLISS to mitigate this interferenee and improve 
performanee and fairness. Third, when all applieations in a workload have high row-buffer loeal- 
ity (100%), the performanee and fairness improvements of BLISS over baseline FRFCFS are a 
bit lower than the other eategories. This is beeause, when all applieations have high row-buffer 
loeality, they eaeh hog the row-buffer in turn and are not as suseeptible to interferenee as the other 
eategories in whieh there are vulnerable low-row-buffer-loeality applieations. However, the per- 
formanee/fairness benefits of BLISS are still signifieant sinee BLISS is effeetive in regulating how 
the row-buffer is shared among different applieations. Overall, we eonelude that BLISS is effeetive 
in aehieving high performanee and fairness aeross workloads with different eompositions of high- 
and low-row-buffer-loeality applieations. 


FRFCFS I - 1 ATLAS 

FRFCFS-Cap i - 1 TCM i - 1 

PARBS BLISS 


FRFCFS I - 1 ATLAS 

FRFCFS-Cap i - 1 TCM i - 1 

PARBS BLISS 
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% of High Row-buffer Locality Benchmarks in a Workload 
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Figure 3.16: Sensitivity to row-buffer loeality 


3.5.11 Sensitivity to System Parameters 

Core and channel count. Figures [3. 17| and [3.18| show the system performanee and fairness of FR¬ 
FCFS, PARBS, TCM and BLISS for different eore eounts (when the ehannel eount is 4) and differ¬ 
ent ehannel eounts (when the eore eount is 24), aeross 40 workloads for eaeh eore/ehannel eount. 
The numbers over the bars indieate pereentage inerease or deerease eompared to FRFCFS. We did 
not optimize the parameters of different sehedulers for eaeh eonfiguration as this requires months of 
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simulation time. We draw three major conclusions. First, the absolute values of weighted speedup 
increase with increasing core/channel count, whereas the absolute values of maximum slowdown 
increase/decrease with increasing core/channel count respectively, as expected. Second, BLISS 
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Figure 3.17: Sensitivity to number of cores 
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Figure 3.18: Sensitivity to number of channels 


achieves higher system performance and lower unfairness than all the other scheduling policies 
(except PARBS, in terms of fairness) similar to our results on the 24-core, 4-channel system, by 
virtue of its effective interference mitigation. The only anomaly is that TCM has marginally higher 
weighted speedup than BLISS for the 64-core system. However, this increase comes at the cost of 
significant increase in unfairness. Third, BLISS’ system performance benefit (as indicated by the 
percentages on top of bars, over FRFCFS) increases when the system becomes more bandwidth 
constrained, i.e., high core counts and low channel counts. As contention increases in the system, 
BLISS has greater opportunity to mitigate it{^ 

i^Faimess benefits reduce at very high core counts and very low channel counts, since memory bandwidth becomes 
highly saturated. 
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Cache size. Figure |6.8| shows the system performanee and fairness for five previous sehedulers 
and BLISS with different last level eaehe sizes (private to eaeh eore). 



FRFCFS [ - 1 ATLAS 

FRFCFS-Cap TCM i - 1 



Figure 3.19: Sensitivity to eaehe size 


We make two observations. First, the absolute values of weighted speedup inerease and maxi¬ 
mum slowdown deerease, as the eaehe size beeomes larger for all sehedulers, as expeeted. This is 
beeause eontention for memory bandwidth reduees with inereasing eaehe eapaeity, improving per¬ 
formanee and fairness. Seeond, aeross all the eaehe eapaeity points we evaluate, BLISS aehieves 
signifieant performanee and fairness benefits over the best-performing previous sehedulers, while 
approaehing elose to the fairness of the fairest previous sehedulers. 



Figure 3.20: Performanee and fairness with a shared eaehe 


Shared Caches. Figure [T20| shows system performanee and fairness with a 32 MB shared eaehe 


(instead of the 512 KB per eore private eaehes used in our other experiments). BLISS aehieves 
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5%/24% better performance/faimess compared to TCM, demonstrating that BLISS is effective in 
mitigating memory interference in the presence of large shared caches as well. 


3.5.12 Sensitivity to Algorithm Parameters 


Tables 3.3 and |3.4| show the system performance and fairness respectively of BLISS for different 
values of the Blacklisting Threshold and Clearing Interval. Three major conclusions are in order. 
First, a Clearing Interval of 10000 cycles provides a good balance between performance and fair¬ 
ness. If the blacklist is cleared too frequently (1000 cycles), interference-causing applications are 
not deprioritized for long enough, resulting in low system performance. In contrast, if the blacklist 
is cleared too infrequently, interference-causing applications are deprioritized for too long, result¬ 
ing in high unfairness. Second, a Blacklisting Threshold of 4 provides a good balance between 
performance and fairness. When Blacklisting Threshold is very small, applications are blacklisted 
as soon as they have very few requests served, resulting in poor interference mitigation as too 
many applications are blacklisted. On the other hand, when Blacklisting Threshold is large, low- 
and high-memory-intensity applications are not segregated effectively, leading to high unfairness. 


Interval 

Threshold''~'~~~--^^ 

1000 

10000 

100000 

2 

8.76 

8.66 

7.95 

4 

8.61 

9.18 

8.60 

8 

8.42 

9.05 

9.24 


Table 3.3: Performance sensitivity to threshold and interval 


Interval 

Threshold''~~~''--^^ 

1000 

10000 

100000 

2 

6.07 

6.24 

7.78 

4 

6.03 

6.54 

7.01 

8 

6.02 

7.39 

7.29 


Table 3.4: Unfairness sensitivity to threshold and interval 
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3.5.13 Interleaving and Scheduling Interaction 


In this section, we study the impact of the address interleaving policy on the performance and fair¬ 
ness of different schedulers. Our analysis so far has assumed a row-interleaved policy, where data 
is distributed across channels, banks and rows at the granularity of a row. This policy optimizes 
for row-buffer locality by mapping a consecutive row of data to the same channel, bank, rank. In 
this section, we will consider two other interleaving policies, cache block interleaving and sub-row 
interleaving. 

Interaction with cache block interleaving. In a cache-block-interleaved system, data is striped 
across channels, banks and ranks at the granularity of a cache block. Such a policy optimizes for 
bank level parallelism, by distributing data at a small (cache block) granularity across channels, 
banks and ranks. 


Figure 3.21 shows the system performance and fairness of FRFCFS with row interleaving 
(FRFCFS-Row), as a comparison point, five previous schedulers, and BLISS with cache block 
interleaving. We draw three observations. First, system performance and fairness of the baseline 
FRFCFS scheduler improve significantly with cache block interleaving, compared to with row in¬ 
terleaving. This is because cache block interleaving enables more requests to be served in parallel 
at the different channels and banks, by distributing data across channels and banks at the small 
granularity of a cache block. Hence, most applications, and particularly, applications that do not 
have very high row-buffer locality benefit from cache block interleaving. 


Second, as expected, application-aware schedulers such as ATLAS and TCM achieve the best 
performance among previous schedulers, by means of prioritizing requests of applications with 
low memory intensities. However, PARES and FRFCFS-Cap do not improve fairness over the 
baseline, in contrast to our results with row interleaving. This is because cache block interleaving 
already attempts to provide fairness by increasing the parallelism in the system and enabling more 
requests from across different applications to be served in parallel, thereby reducing unfair appli¬ 
cations slowdowns. More specifically, requests that would be row-buffer hits to the same bank. 
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Figure 3.21: Scheduling and cache block interleaving 


with row interleaving, are now distributed across multiple channels and banks, with cache block 
interleaving. Hence, applications’ propensity to cause interference reduces, providing lower scope 
for request capping based schedulers such as FRFCFS-Cap and PARBS to mitigate interference. 
Third, BLISS achieves within 1.3% of the performance of the best performing previous sched¬ 
uler (ATLAS), while achieving 6.2% better fairness than the fairest previous scheduler (PARBS). 
BLISS effectively mitigates interference by regulating the number of consecutive requests served 
from high-memory-intensity applications that generate a large number of requests, thereby achiev¬ 
ing high performance and fairness. 

Interaction with sub-row interleaving. While memory scheduling has been a prevalent approach 
to mitigate memory interference, previous work has also proposed other solutions, as we describe 
in Chapter]^ One such previous work by Kaseridis et al. [[5^ proposes minimalist open page, an 
interleaving policy that distributes data across channels, ranks and banks at the granularity of a 
sub-row (partial row), rather than an entire row, exploiting both row-buffer locality and bank-level 
parallelism. We examine BLISS’ interaction with such a sub-row interleaving policy. 


Figure 3.22 shows the system performance and fairness of FRFCFS with row interleaving 
(FRFCFS-Row), FRFCFS with cache block interleaving (FRFCFS-Block) and five previously pro¬ 
posed schedulers and BLISS, with sub-row interleaving (when data is striped across channels, 
ranks and banks at the granularity of four cache blocks). 
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Figure 3.22: Scheduling and sub-row interleaving 


Three observations are in order. First, sub-row interleaving provides significant benefits over 
row interleaving, as can be observed for FRFCFS (and other scheduling policies by comparing with 
Figure [T^ . This is because sub-row interleaving enables applications to exploit both row-buffer 
locality and bank-level parallelism, unlike row interleaving that is mainly focused on exploiting 
row-buffer locality. Second, sub-row interleaving achieves similar performance and fairness as 
cache block interleaving. We observe that this is because cache block interleaving enables ap¬ 
plications to exploit parallelism effectively, which makes up for the lost row-buffer locality from 
distributing data at the granularity of a cache block across all channels and banks. Third, BLISS 
achieves close to the performance (within 1.5%) of the best performing previous scheduler (TCM), 
while reducing unfairness significantly and approaching the fairness of the fairest previous sched¬ 
ulers. One thing to note is that BLISS has higher unfairness than FRFCFS, when a sub-row- 
interleaved policy is employed. This is because the capping decisions from sub-row interleav¬ 
ing and BLISS could collectively restrict high-row-buffer locality applications to a large degree, 
thereby slowing them down and causing higher unfairness. Co-design of the scheduling and in¬ 
terleaving policies to achieve different goals such as performance/fairness is an important area of 
future research. We conclude that a BLISS-like scheduler, with its high performance and low com¬ 
plexity is a significantly better alternative to schedulers such as ATLAS/TCM in the pursuit of such 
scheduling-interleaving policy co-design. 
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3.6 Summary 


In summary, the Blacklisting memory scheduler (BLISS) is a new and simple approach to mem¬ 
ory scheduling in systems with multiple threads. We observe that the per-application ranking 
mechanisms employed by previously proposed application-aware memory schedulers incur high 
hardware cost, cause high unfairness, and lead to high scheduling latency to the point that the 
scheduler cannot meet the fast command scheduling requirements of state-of-the-art DDR proto¬ 
cols. BLISS overcomes these problems based on the key observation that it is sufficient to group 
applications into only two groups, rather than employing a total rank order among all applications. 
Our evaluations across a variety of workloads and systems demonstrate that BLISS has better sys¬ 
tem performance and fairness than previously proposed ranking-based schedulers, while incurring 
significantly lower hardware cost and latency in making scheduling decisions. 
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Chapter 4 


Quantifying Application Slowdowns Due to 
Main Memory Interference 


In a multicore system, an applieation’s performanee and slowdowns depend heavily on its eorun- 
ning applieations and the amount of shared resouree interferenee they eause, as we demonstrated 
and discussed in Chapter While the Blaeklisting Seheduler (BLISS) is able to aehieve high 
system performanee and fairness at low hardware eomplexity in the presence of main memory 
interferenee, it does not have the ability to estimate and control application slowdowns. 

The ability to aeeurately estimate applieation slowdowns ean enable several use eases. For 
instanee, estimating the slowdown of each application may enable a eloud serviee provider 
m to estimate the performance provided to eaeh application in the presence of eonsolidation on 
shared hardware resources, thereby billing the users appropriately. Perhaps more importantly, 
aeeurate slowdown estimates may enable alloeation of shared resourees to different applieations in 
a slowdown-aware manner, thereby satisfying different applieations’ performanee requirements. 

Mechanisms and models to accurately estimate application slowdowns due to shared resouree 
interference have not been explored as mueh as shared resouree interferenee mitigation teehniques 
have. Furthermore, the few previous works on slowdown estimation, STFM [[8^ . FST Il271 and 
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PTCA [|25B are inaccurate, as we briefly discuss in Section 2.11 These works estimate slowdown 


as the ratio of uninterfered to interfered stall/execution times. The uninterfered stall/execution 
times are computed by estimating the number of cycles by which the interference experienced 
by each individual request impacts execution time. Given the abundant parallelism available in 
the memory subsystem, service of different requests overlap significantly. As a result, accurately 
estimating the number of cycles by which each request is delayed due to interference is inherently 
difficult, thereby resulting in high inaccuracies in the slowdown estimates. 


We seek to accurately estimate application slowdowns due to memory bandwidth interference, 
as a key step towards controlling application slowdowns. Towards this end, we first build the Mem¬ 
ory Interference induced Slowdown Estimation (MISE) model to accurately estimate application 
slowdowns in the presence of memory bandwidth interference. 


4.1 The MISE Model 


In this section, we provide a detailed description of our Memory Interference induced Slowdown 
Estimation (MISE) model that estimates application slowdowns due to memory bandwidth inter¬ 
ference. Eor ease of understanding, we first describe the observations that lead to a simple model 
for estimating the slowdown of a memory-bound application when it is run concurrently with other 
applications (Section |4.1.1[). In Section 4.1.2 we describe how we extend the model to accommo¬ 


date non-memory-bound applications. Section 4.2 describes the detailed implementation of our 
model in a memory controller. 


4.1.1 Memory-bound Application 

A memory-bound application is one that spends an overwhelmingly large fraction of its execution 
time stalling on memory accesses. Therefore, the rate at which such an application’s requests 
are served has significant impact on its performance. More specifically, we make the following 


54 







observation about a memory-bound application. 


Observation 1: The performance of a memory-bound application is roughly propor¬ 
tional to the rate at which its memory requests are served. 

For instance, for an application that is bottlenecked at memory, if the rate at which its requests 
are served is reduced by half, then the application will take twice as much time to finish the same 
amount of work. To validate this observation, we conducted a real-system experiment where we 
ran memory-bound applications from SPEC CPU2006 dUl on a 4-core Intel Core iV [|40l . Each 
SPEC application was run along with three copies of a microbenchmark whose memory intensity 
can be varied^ By varying the memory intensity of the microbenchmark, we can change the rate 
at which the requests of the SPEC application are served. 

Eigure [0] plots the results of this experiment for three memory-intensive SPEC benchmarks, 
namely, mcf omnetpp, and astar. The figure shows the performance of each application vs. the 
rate at which its requests are served. The request service rate and performance are normalized to 
the request service rate and performance respectively of each application when it is run alone on 
the same system. 



Normalized Request Service Rate 
(normalized to request service rate when run alone) 

Eigure 4.1: Request service rate vs. performance 

'The microbenchmark streams through a large region of memory (one block at a time). The memory intensity 
of the microbenchmark (LLC MPKI) is varied by changing the amount of computation performed between memory 
operations. 
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The results of our experiments validate our observation. The performanee of a memory-bound 
applieation is direetly proportional to the rate at whieh its requests are served. This suggests that we 
ean use the request-serviee-rate of an applieation as a proxy for its performanee. More speeifieally, 
we ean eompute the slowdown of an applieation, i.e., the ratio of its performanee when it is run 
alone on a system vs. its performanee when it is run alongside other applieations on the same 
system, as follows: 


. alone-request-service-rate 

Slowdown of an App. = - 

shared-request-service-rate 


(4.1) 


Estimating the shared-request-service-rate (SRSR) of an applieation is straightforward. It just 
requires the memory eontroller to keep traek of how many requests of the applieation are served 
in a given number of eyeles. However, the ehallenge is to estimate the alone-request-service-rate 
(ARSR) of an applieation while it is run alongside other applieations. A naive way of estimating 
ARSR of an applieation would be to prevent all other applieations from aeeessing memory for a 
length of time and measure the applieation’s ARSR. While this would provide an aeeurate estimate 
of the applieation’s ARSR, this approaeh would signifieantly slow down other applieations in the 
system. Our seeond observation helps us to address this problem. 


Observation 2: The ARSR of an application can be estimated by giving the requests 
of the application the highest priority in accessing memory. 


Giving an applieation’s requests the highest priority in aeeessing memory results in very little 
interferenee from the requests of other applieations. Therefore, many requests of the applieation 
are served as if the applieation were the only one running on the system. Based on the above 
observation, the ARSR of an applieation ean be eomputed as follows: 


. # Requests with Highest Priority 

ARSR of an App. =- - -^- - 

# Cyeles with Highest Priority 


(4.2) 
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where # Requests with Highest Priority is the number of requests served when the applieation is 
given highest priority, and # Cycles with Highest Priority is the number of eyeles an applieation is 
given highest priority by the memory eontroller. 


The memory eontroller ean use Equation |4.2| to periodieally estimate the ARSR of an appli¬ 
eation and Equation |4. 1 to measure the slowdown of the applieation using the estimated ARSR. 
Seetion \4?2 \ provides a detailed deseription of the implementation of our model inside a memory 
eontroller. 


4.1.2 Non-memory-bound Application 

So far, we have deseribed our MISE model for a memory-bound applieation. We find that the 
model presented above has low aeeuraey for non-memory-bound applieations. This is beeause a 
non-memory-bound applieation spends a signifieant fraetion of its exeeution time in the compute 
phase (when the eore is not stalled waiting for memory). Henee, varying the request serviee rate 
for sueh an applieation will not affeet the length of the large eompute phase. Therefore, we take 
into aeeount the duration of the eompute phase to make the model aeeurate for non-memory-bound 
applieations. 

Eet a be the fraetion of time spent by an applieation at memory. Therefore, the fraetion of time 
spent by the applieation in the eompute phase is 1 — a. Sinee ehanging the request serviee rate 
affeets only the memory phase, we augment Equation |4.1| to take into aeeount a as follows: 

, , ARSR 

Slowdown ofanApp. = (1 — a) -f a - (4.3) 

S R.S R. 

In addition to estimating ARSR and SRSR required by Equation |4.1[ the above equation requires 
estimating the parameter a, the fraetion of time spent in memory phase. However, preeisely eom- 
puting a for a modern out-of-order proeessor is a ehallenge sinee sueh a proeessor overlaps eom- 
putation with memory aeeesses. The proeessor stalls waiting for memory only when the oldest 
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instruction in the reorder buffer is waiting on a memory request. For this reason, we estimate a as 
the fraction of time the proeessor spends stalling for memory. 


# Cyeles spent stalling on memory requests 
Total number of eyeles 


(4.4) 


Setting a to 1 reduces Equation 4.3 to Equation 4.1 We find that even when an applieation is 
moderately memory-intensive, setting a to 1 provides a better estimate of slowdown. Therefore, 
our final model for estimating slowdown takes into aeeount the stall fraetion (a) only when it is 
low. Algorithm shows our final slowdown estimation model. 


Compute a; 

if a < Threshold then 

Slowdown = (1 - a)-1-a Ilf 

else 

Slowdown = ff 

end 


Algorithm 1: The MISE model 


4.2 Implementation 

In this section, we deseribe a detailed implementation of our MISE model in a memory eontroller. 
Eor each application in the system, our model requires the memory controller to compute three pa¬ 
rameters: 1) shared-request-service-rate (SRSR), 2) alone-request-service-rate (ARSR), and 3) a 
(stall fraction) ji] Eirst, we describe the scheduling algorithm employed by the memory controller. 
Then, we deseribe how the memory eontroller eomputes eaeh of the three parameters. 

4.2.1 Memory Scheduling Algorithm 

In order to implement our model, each application needs to be given the highest priority period- 

ieally, such that its alone-request-service-rate can be measured. This ean be achieved by sim- 

^These three parameters need to be computed only for the active applications in the system. Hence, these need to 
be tracked only per hardware thread context. 
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ply assigning each application’s requests highest priority in a round-robin manner. However, the 
mechanisms we build on top of our model allocate bandwidth to different applications to achieve 
QoS/fairness. Therefore, in order to facilitate the implementation of our mechanisms, we employ 
a lottery-scheduling-like approach [|93l II1711 to schedule requests in the memory controller. The 
basic idea of lottery scheduling is to probabilistically enforce a given bandwidth allocation, where 
each application is allocated a certain share of the bandwidth. The exact bandwidth allocation 
policy depends on the goal of the system - e.g., QoS, high performance, high fairness, etc. In 
this section, we describe how a lottery-scheduling-like algorithm works to enforce a bandwidth 
allocation. 

The memory controller divides execution time into intervals (of M. processor cycles each). 
Each interval is further divided into small epochs (of M processor cycles each). At the beginning 
of each interval, the memory controller estimates the slowdown of each application in the system. 
Based on the slowdown estimates and the final goal, the controller may change the bandwidth al¬ 
location policy - i.e., redistribute bandwidth amongst the concurrently running applications. At 
the beginning of each epoch, the memory controller probabilistically picks a single application and 
prioritizes all the requests of that particular application during that epoch. The probability distri¬ 
bution used to choose the prioritized application is such that an application with higher bandwidth 
allocation has a higher probability of getting the highest priority. For example, consider a system 
with two applications, A and B. If the memory controller allocates A 75% of the memory band¬ 
width and B the remaining 25%, then A and B get the highest priority with probability 0.75 and 
0.25, respectively. 

4.2.2 Computing shared-request-service-rate (SRSR) 

The shared-request-service-rate of an application is the rate at which the application’s requests are 
served while it is running with other applications. This can be directly computed by the memory 
controller using a per-application counter that keeps track of the number of requests served for that 


59 




application. At the beginning of eaeh interval, the eontroller resets the eounter for eaeh applieation. 
Whenever a request of an applieation is served, the eontroller inerements the eounter eorresponding 
to that application. At the end of eaeh interval, the SRSR of an applieation is eomputed as 


SRSR of an App 


# Requests served 
M. (Interval Length) 


4.2.3 Computing alone-request-service-rate (ARSR) 


The alone-request-service-rate (ARSR) of an application is an estimate of the rate at which the 
applieation’s requests would have been served had it been running alone on the same system. Based 


on our observation (described in Seetion 4.1.1), the ARSR ean be estimated by using the request- 
serviee-rate of the applieation when its requests have the highest priority in aeeessing memory. 
Therefore, the memory eontroller estimates the ARSR of an applieation only during the epochs in 
which the applieation has the highest priority. 


Ideally, the memory eontroller should be able to aehieve this using two eounters: one to keep 
traek of the number of epochs during whieh the applieation reeeived highest priority and another 
to keep track of the number of requests of the applieation served during its highest-priority epochs. 
However, it is possible that even when an applieation’s requests are given highest priority, they may 
receive interference from other applieations’ requests. This is beeause, our memory seheduling is 
work conserving - if there are no requests from the highest priority application, it schedules a ready 
request from some other applieation. Onee a request is seheduled, it eannot be preempted beeause 
of the way DRAM operates. 

In order to aceount for this interference, the memory controller uses a third counter for each 
application to track the number of eyeles during whieh an applieation’s request was blocked due to 
some other applieation’s request, in spite of the former having highest priority. For an applieation 
with highest priority, a eyele is deemed to be an interference cycle if during that eyele, a eommand 
eorresponding to a request of that applieation is waiting in the request buffer and the previous 
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command issued to any bank, was for a request from a different application. 

Based on the above discussion, the memory controller keeps track of three counters to compute 
the ARSR of an application: 1) number of highest-priority epochs of the application (# HPEs), 
2) number of requests of that application served during its highest-priority epochs (# HPE Re¬ 
quests), and 3) number of interference cycles of the application during its highest-priority epochs 
(# Interference cycles). All these counters are reset at the start of an interval and the ARSR is 
computed at the end of each interval as follows: 

^ ^ # HPE Requests 

ARSR of an App. = —^;-- 

Af. (# HPEs) — (# Interference cycles) 

Our model does not take into account bank level parallelism (BEP) or row-buffer interference 
when estimating # Interference cycles. We observe that this does not affect the accuracy of our 
model significantly, because we eliminate most of the interference by measuring ARSR only when 
an application has highest priority. We leave a study of the effects of bank-level parallelism and 
row-buffer interference on the accuracy of our model as part of future work. 

4.2.4 Computing stall-fraction a 

The stall-fraction (a) is the fraction of the cycles spent by the application stalling for memory 
requests. The number of stall cycles can be easily computed by the core and communicated to the 
memory controller at the end of each interval. 

4.2.5 Hardware Cost 

Our implementation incurs additional storage cost due to 1) the counters that keep track of param¬ 
eters required to compute slowdown (five per hardware thread context), and 2) a register that keeps 
track of the current bandwidth allocation policy (one per hardware thread context). We find that 
using four byte registers for each counter is more than sufficient for the values they keep track of. 
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Therefore, our model ineurs a storage cost of at most 24 bytes per hardware thread context. 


4.3 Methodology 


Simulation Setup. We model the memory system using an in-house cycle-accurate DDR3-SDRAM 
simulator. We have integrated this DDRS simulator into an in-house cycle-level x86 simulator with 
a Pin ITTSII frontend, which models out-of-order cores with a limited-size instruction window. Each 
core has a 512 KB private cache. We model main memory as the only shared resource, in or¬ 


der to isolate and analyze the effect of memory interference on application slowdowns. Table 4.1 
provides more details of the simulated systems. 


Unless otherwise specified, the evaluated systems consist of 4 cores and a memory subsystem 
with 1 channel, 1 rank/channel and 8 banks/rank. We use row-interleaving to map the physical 
address space onto DRAM channels, ranks and banks. Data is striped across different channels, 
ranks and banks, at the granularity of a row. Our workloads are made up of 26 benchmarks from 
the SPEC CPU2006 [0 suite. 


Workloads. We form multiprogrammed workloads using combinations of these 26 benchmarks. 
We extract a representative phase of each benchmark using PinPoints W2\ and run that phase for 
200 million cycles. We will provide more details about our workloads as and when required. 


Processor 

4-16 cores, 5.3GHz, 3-wide issue, 

8 MSHRs, 128-entry instruction window 

Last-level cache 

64B cache-line, 16-way associative, 

512KB private cache-slice per core 

Memory controller 

64/64-entry read/write request queues per controller 

Memory 

Timing: DDR3-1066 (8-8-8) GD 

Organization: 1 channel, 1 rank-per-channel, 

8 banks-per-rank, 8 KB row-buffer 

Table 4.1: 

Configuration of the simulated system 


Metrics. We use average error to compare the accuracy of MISE and previously proposed models. 
We compute slowdown estimation error for each application, at the end of every quantum (Q), as 
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the absolute value of 


Error 


Estimated Slowdown — Aetual Slowdown 
Aetual Slowdown 


X 100% 


Aetual Slowdown = 

^ ^ ^ shared 


We eompute I PCaione for the same amount of work as the shared run for eaeh quantum. Eor eaeh 
applieation, we eompute the average slowdown estimation error aeross all quanta in a workload 
run and then eompute the average aeross all oeeurrenees of the applieation in all of our workloads. 

Parameters. We use an interval length (Ai) of 5 million eyeles and an epoeh length (Af) of 10000 
eyeles for all our evaluations. Seetion [43] evaluates sensitivity of our model to these parameters. 


4.4 Comparison to STFM 

Stall-Time-Eair Memory seheduling (STEM) lf86ll is one of the few previous works that attempt to 
estimate main-memory-indueed slowdowns of individual applieations when they are run eoneur- 


rently on a multieore system. As we deseribed in Seetion 2.11 and earlier in this ehapter, STEM 
estimates the slowdown of an applieation by estimating the number of eyeles it stalls due to inter- 
ferenee from other applieations’ requests. Other previous works on slowdown estimation [|27ll25ll 
estimate slowdown due to memory bandwidth interferenee in a similar manner as STEM and in 
addition, estimate slowdown due to shared eaehe interferenee as well. Sinee MISE’s foeus is 
on slowdown estimation in the presenee of memory bandwidth interferenee, we qualitatively and 
quantitatively eompare MISE with STEM. 

There are two key differenees between MISE and STEM for estimating slowdown. Eirst, MISE 
uses request serviee rates rather than stall times to estimate slowdown. As we mentioned in See¬ 


tion 4.1, the alone-request-service-rate of an applieation ean be fairly aeeurately estimated by 
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giving the application highest priority in accessing memory. Giving the application highest pri¬ 
ority in accessing memory results in very little interference from other applications. In contrast, 
STFM attempts to estimate the alone-stall-time of an application while it is receiving significant 
interference from other applications. Second, MISE takes into account the effect of the compute 
phase for non-memory-bound applications. STFM, on the other hand, has no such provision to 
account for the compute phase. As a result, MISE’s slowdown estimates for non-memory-bound 
applications are significantly more accurate than STEM’S estimates. 

Figure [43] compares the accuracy of the MISE model with STFM for six representative memory- 
bound applications from the SPEC CPU2006 benchmark suite. Each application is run on a 4-core 
system along with three other applications: sphinxS, leslieSd, and mile. The figure plots three 
curves: 1) actual slowdown, 2) slowdown estimated by STFM, and 3) slowdown estimated by 
MISE. For most applications in the SPEC CPU2006 suite, the slowdown estimated by MISE is 
significantly more accurate than STEM’S slowdown estimates. All applications whose slowdowns 
are shown in Figure |4.2[ except sphinxS, are representative of this behavior. For a few applica¬ 
tions and workload combinations, STEM’S estimates are comparable to the slowdown estimates 
from our model: sphinxS is an example of such an application. However, as we will show below, 
across all workloads, the MISE model provides lower average slowdown estimation error for all 
applications. 

Figure |4.3| compares the accuracy of MISE with STFM for three representative non-memory- 
bound applications, when each application is run on a 4-core system along with three other appli¬ 
cations: sphinxS, leslieSd, and mile. As shown in the figure, MISE’s estimates are significantly 
more accurate compared to STEM’S estimates. As mentioned before, STFM does not account for 
the compute phase of these applications. However, these applications spend significant amount of 
their execution time in the compute phase. This is the reason why our model, which takes into 
account the effect of the compute phase of these applications, is able to provide more accurate 
slowdown estimates for non-memory-bound applications. 
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(a) Ibm 


(b) leslieSd 


(c) sphinx3 





(d) GemsFDTD 


(e) soplex 


(f) cactusADM 


Figure 4.2: Comparison of our MISE model with STEM for representative memory-bound appli- 
eations 



Actual 

STFM 

MISE . 

4 

3.5 

^ 3 

1 2.5 


Actual 

STFM 

MISE . 

V--' 


^ 2 



A KjsJ\ 


1.5 





0 20 40 60 80 100 120 140 160 180 200 


20 40 60 80 100 120 140 160 180 200 


20 40 60 80 100 120 140 160 180 200 


Million Cycles 


Million Cycles 


Million Cycles 


(a) wrf 


(b) povray 


(c) calculix 


Eigure 4.3: Comparison of our MISE model with STEM for representative non-memory-bound 
applieations 


Table 4.2 shows the average slowdown estimation error for each benchmark, with STEM and 


MISE, across all 300 4-core workloads of different memory intensities]^ As can be observed, 
MISE’s slowdown estimates have significantly lower error than STEM’S slowdown estimates across 
most benchmarks. Across 300 workloads, STEM’S estimates deviate from the actual slowdown by 
29.8%, whereas, our proposed MISE model’s estimates deviate from the actual slowdown by only 
8.1%. Therefore, we conclude that our slowdown estimation model provides better accuracy than 
STEM. 


^See Table 


5.1 


and Section 


5.1.3 


for more details about these 300 workloads. 
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Benchmark 

STFM 

MISE 

Benchmark 

STFM 

MISE 

453.povray 

56.3 

0.1 

473.astar 

12.3 

8.1 

454.calculix 

43.5 

1.3 

456.hmmer 

17.9 

8.1 

400.perlbench 

26.8 

1.6 

464.h264ref 

13.7 

8.3 

447.dealII 

37.5 

2.4 

401.bzip2 

28.3 

8.5 

436.cactusADM 

18.4 

2.6 

458. sj eng 

21.3 

8.8 

450.soplex 

29.8 

3.5 

433. mile 

26.4 

9.5 

444.namd 

43.6 

3.7 

481.wrf 

33.6 

11.1 

437.1eslie3d 

26.4 

4.3 

429.mef 

83.74 

11.5 

403.gcc 

25.4 

4.5 

445.gobmk 

23.1 

12.5 

462.1ibquantum 

48.9 

5.3 

483.xalancbmk 

18.0 

13.6 

459.GemsFDTD 

21.6 

5.5 

435.gromacs 

31.4 

15.6 

470.1bm 

6.9 

6.3 

482.sphinx3 

21 

16.8 

473.astar 

12.3 

8.1 

471 .omnetpp 

26.2 

17.5 

456.hmmer 

17.9 

8.1 

465.tonto 

32.7 

19.5 


Table 4.2: Average error for each benchmark (in %) 


4.5 Sensitivity to Algorithm Parameters 

We evaluate the sensitivity of the MISE model to epoch and interval lengths. Table [43] presents 
the average error (in %) of the MISE model for different values of epoch and interval lengths. Two 
major conclusions are in order. Eirst, when the interval length is small (1 million cycles), the error 
is very high. This is because the request service rate is not stable at such small interval lengths 
and varies significantly across intervals. Therefore, it cannot serve as an effective proxy for perfor¬ 
mance. On the other hand, when the interval length is larger, request service rate exhibits a more 
stable behavior and can serve as an effective measure of application slowdowns. Therefore, we 
conclude that except at very low interval lengths, the MISE model is robust. Second, the average 
error is high for high epoch lengths (1 million cycles) because the number of epochs in an interval 
reduces. As a result, some applications might not be assigned highest priority for any epoch during 
an interval, preventing estimation of their alone-request-service-rate. Note that the effect of this is 
mitigated as the interval length increases, as with a larger interval length the number of epochs in 
an interval increases. Eor smaller epoch length values, however, the average error of MISE does 
not exhibit much variation and is robust. The lowest average error of 8.1% is achieved at an inter¬ 
val length of 5 million cycles and an epoch length of 10000 cycles. Eurthermore, we observe that 
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estimating slowdowns at an interval length of 5 million cycles also enables enforcing QoS at fine 
time granularities, although, higher interval lengths exhibit similar average error. Therefore, we 
use these values of interval and epoch lengths for our evaluations. 


interval 

Length 

Epoch 

Length 

1 mil. 

5 mil. 

10 mil. 

25 mil. 

50 mil. 

1000 

65.1% 

9.1% 

11.5% 

10.7% 

8.2% 

10000 

64.1% 

8.1% 

9.6% 

8.6% 

8.5% 

100000 

64.3% 

11.2% 

9.1% 

8.9% 

9% 

1000000 

64.5% 

31.3% 

14.8% 

14.9% 

11.7% 


Table 4.3: Sensitivity of average error to epoch and interval lengths 


4.6 Summary 

In summary, we propose MISE, a new and simple model to estimate application slowdowns due 
to inter-application interference in main memory. MISE is based on two simple observations: 
1) the rate at which an application’s memory requests are served can be used as a proxy for the 
application’s performance, and 2) the uninterfered request-service-rate of an application can be 
accurately estimated by giving the application’s requests the highest priority in accessing main 
memory. Compared to state-of-the-art approaches for estimating main memory slowdowns, MISE 
is simpler and more accurate, as our evaluations show. 
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Chapter 5 


Applications of the MISE Model 


Accurate slowdown estimates from the MISE model can be leveraged in multiple possible ways. 
On the one hand, they can be leveraged in hardware, to perform allocation of memory bandwidth to 
different applications, such that the overall system performance/fairness is improved or different 
applications’ performance guarantees are met. On the other hand, MISE’s slowdown estimates 
can be communicated to the system software/hypervisor, enabling virtual machine migration and 
admission control schemes. 

We propose and evaluate two such use cases of MISE: I) a mechanism to provide soft QoS 
guarantees (MISE-QoS) and 2) a mechanism that attempts to minimize maximum slowdown to 
improve overall system fairness (MISE-Eair). 


5.1 MISE-QoS: Providing Soft QoS Guarantees 

MISE-QoS is a mechanism to provide soft QoS guarantees to one or more applications of inter¬ 
est in a workload with many applications, while trying to maximize overall performance for the 
remaining applications. By soft QoS guarantee, we mean that the applications of interest (Aols) 
should not be slowed down by more than an operating-system-specified bound. One way of achiev- 
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ing such a soft QoS guarantee is to always prioritize the Aols. However, sueh a meehanism has two 
shorteomings. First, it would work when there is only one Aol. With more than one Aol, prioritiz¬ 
ing all Aols will eause them to interfere with eaeh other making their slowdowns uneontrollable. 
Seeond, even with just one Aol, a meehanism that always prioritizes the Aol may unneeessarily 
slow down other applieations in the system. MISE-QoS addresses these shorteomings by using 
slowdown estimates of the Aols to alloeate them just enough memory bandwidth to meet their 
speeified slowdown bound. We present the operation of MISE-QoS with one Aol and then de- 
seribe how it ean be extended to multiple Aols. 


5.1.1 Mechanism Description 


The operation of MISE-QoS with one Aol is simple. As we deseribe in Seetion 4.2.1 the memory 
eontroller divides exeeution time into intervals of length A4. The eontroller maintains the eurrent 
bandwidth alloeation for the Aol. At the end of eaeh interval, it estimates the slowdown of the 
Aol and eompares it with the speeified bound, say B. If the estimated slowdown is less than 
B, then the eontroller reduees the bandwidth alloeation for the Aol by a small amount (2% in 
our experiments). On the other hand, if the estimated slowdown is more than B, the eontroller 
inereases the bandwidth alloeation for the Aol (by 2%){^ The remaining bandwidth is used by all 
other applieations in the system in a free-for-all manner. The above meehanism attempts to ensure 
that the Aol gets just enough bandwidth to meet its target slowdown bound. As a result, the other 
applieations in the system are not unnecessarily slowed down. 

In some eases, it is possible that the target bound cannot be met even by alloeating all the mem¬ 
ory bandwidth to the Aol - i.e., prioritizing its requests 100% of the time. This is beeause, even 
the applieation with the highest priority (Aol) eould be subjeet to interferenee, slowing it down by 


some faetor, as we deseribe in Seetion 4.2.3 Therefore, in seenarios when it is not possible to meet 
the target bound for the Aol, the memory eontroller ean eonvey this information to the operating 


'We found that 2% increments in memory bandwidth work well empirically, as our results indicate. Better tech¬ 
niques that dynamically adapt the increment are possible and are a part of our future work. 
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system, which can then take appropriate action (e.g., deschedule some other applications from the 
machine). 

5.1.2 MISE-QoS with Multiple Aols 

The above described MISE-QoS mechanism can be easily extended to a system with multiple 
Aols. In such a system, the memory controller maintains the bandwidth allocation for each Aol. 
At the end of each interval, the controller checks if the slowdown estimate for each Aol meets the 
corresponding target bound. Based on the result, the controller either increases or decreases the 
bandwidth allocation for each Aol (similar to the mechanism in Section [5.1.1| ). 

With multiple Aols, it may not be possible to meet the specified slowdown bound for any of 
the Aols. Our mechanism concludes that the specified slowdown bounds cannot be met if: 1) all 
the available bandwidth is partitioned only between the Aols - i.e., no bandwidth is allocated to 
the other applications, and 2) any of the Aols does not meet its slowdown bound after R intervals 
(where R is empirically determined at design time). Similar to the scenario with one Aol, the 
memory controller can convey this conclusion to the operating system (along with the estimated 
slowdowns), which can then take an appropriate action. Note that other potential mechanisms for 
determining whether slowdown bounds can be met are possible. 

5.1.3 Evaluation with Single Aol 

To evaluate MISE-QoS with a single Aol, we run each benchmark as the Aol, alongside 12 dif¬ 
ferent workload mixes shown in Table 15.1[ We run each workload with 10 different slowdown 
bounds for the Aol: y, y,..., These slowdown bounds are chosen so as to have more data 
points between the bounds of 1 x and 5 x In all, we present results for 3000 data points with dif¬ 
ferent workloads and slowdown bounds. We compare MISE-QoS with a mechanism that always 
prioritizes the Aol ll^ {AlwaysPrioritize). 

^Most applications are not slowed down by more than 5 x for our system configuration. 
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Mix No. 

Benchmark 1 

Benchmark 2 

Benchmark 3 

1 

sphinxS 

leslie3d 

mile 

2 

sjeng 

gcc 

perlbench 

3 

tonto 

povray 

wrf 

4 

perlbench 

gcc 

povray 

5 

gcc 

povray 

leslie3d 

6 

perlbench 

namd 

Ibm 

7 

hef 

bzip2 

libquantum 

8 

hmmer 

Ibm 

omnetpp 

9 

sjeng 

libquantum 

cactus ADM 

10 

namd 

libquantum 

mcf 

11 

xalancbmk 

mcf 

astar 

12 

mcf 

libquantum 

leslie3d 


Table 5.1: Workload mixes 


Table |5.2| shows the effectiveness of MISE-QoS in meeting the prescribed slowdown bounds for 
the 3000 data points. As shown, for approximately 79% of the workloads, MISE-QoS meets the 
specified bound and correctly estimates that the bound is met. However, for 2.1% of the workloads, 
MISE-QoS does meet the specified bound but it incorrectly estimates that the bound is not met. 
This is because, in some cases, MISE-QoS slightly overestimates the slowdown of applications. 
Overall, MISE-QoS meets the specified slowdown bound for close to 80.9% of the workloads, 
as compared to AlwaysPrioritize that meets the bound for 83% of the workloads. Therefore, we 
conclude that MISE-QoS meets the bound for 97.5% of the workloads where AlwaysPrioritize 
meets the bound. Eurthermore, MISE-QoS correctly estimates whether or not the bound was met 
for 95.7% of the workloads, whereas AlwaysPrioritize has no provision to estimate whether or not 
the bound was met. 


Scenario 

# Workloads 

% Workloads 

Bound Met and Predicted Right 

2364 

78.8% 

Bound Met and Predicted Wrong 

65 

2.1% 

Bound Not Met and Predicted Right 

509 

16.9% 

Bound Not Met and Predicted Wrong 

62 

2.2% 


Table 5.2: Effectiveness of MISE-QoS 


To show the effectiveness of MISE-QoS, we compare the AoTs slowdown due to MISE-QoS 
and the mechanism that always prioritizes the Aol (AlwaysPrioritize) [[44ll . Eigure [5T| presents 
representative results for 8 different Aols when they are run alongside Mix 1 (Table |5.1|). The 
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label MISE-QoS-n corresponds to a slowdown bound of (Note that AlwaysPrioritize does not 
take into account the slowdown bound). Note that the slowdown bound decreases (i.e., becomes 


tighter) from left to right for each benchmark in Figure [5TT] (as well as in other figures). We draw 


three conclusions from the results. 


AlwaysPrioritize r ///////////] 

MISE-QoS-1 I - 1 

MISE-QoS-2 I - 1 

MISE-QoS-3 I - 1 

MISE-QoS-4 I - 1 

MISE-QoS-5 I - 1 


MISE-QoS-6 I 
MISE-QoS-7 [ 
MISE-QoS-8 [ 
MISE-QoS-9 [ 
MISE-QoS-10 I 



Figure 5.1: Aol performance: MISF-QoS vs. AlwaysPrioritize 


First, for most applications, the slowdown of AlwaysPrioritize is considerably more than one. 


As described in Section 5.1.F always prioritizing the Aol does not completely prevent other appli¬ 


cations from interfering with the AoL 


Second, as the slowdown bound for the Aol is decreased (left to right), MISF-QoS gradually 
increases the bandwidth allocation for the Aol, eventually allocating all the available bandwidth to 
the Aol. At this point, MISF-QoS performs very similarly to the AlwaysPrioritize mechanism. 


Third, in almost all cases (in this figure and across all our 3000 data points), MISE-QoS meets 
the specified slowdown bound if AlwaysPrioritize is able to meet the bound. One exception to 
this is benchmark gromacs. For this benchmark, MISE-QoS meets the slowdown bound for values 
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ranging from ^ to For slowdown bound values of y and y, MISE-QoS does not meet the 
bound even though alloeating all the bandwidth for gromacs would have aehieved these slowdown 
bounds (sinee AlwaysPrioritize ean meet the slowdown bound for these values). This is beeause 
our MISE model underestimates the slowdown for gromacs. Therefore, MISE-QoS ineorreetly 
assumes that the slowdown bound is met for gromacs. 


Overall, MISE-QoS aeeurately estimates the slowdown of the Aol and alloeates just enough 


bandwidth to the Aol to meet a slowdown bound. As a result, MISE-QoS is able to signifieantly 


improve the performanee of the other applieations in the system (as we show next). 


System Performance and Fairness. Eigure [5^ compares the system performanee (harmonic 


speedup) and fairness (maximum slowdown) of MISE-QoS and AlwaysPrioritize for different val¬ 


ues of the bound. We omit the Aol from the performance and fairness calculations. The results are 


categorized into four workload categories (0, 1, 2, 3) indicating the number of memory-intensive 


benchmarks in the workload. Eor clarity, the figure shows results only for a few slowdown bounds. 


Three conclusions are in order. 


AlwaysPrioritize wmnm MISE-QoS-5 

MISE-QoS-1 I - 1 MISE-QoS-7 

MISE-QoS-3 MISE-QoS-9 


AlwaysPrioritize wmmm MISE-QoS-5 i 

MISE-QoS-1 I - 1 MISE-QoS-7 

MISE-QoS-3 I - 1 MISE-QoS-9 


o 

E 

ra 

I 



1 2 3 Avg 

Number of Memory Intensive Benchmarks in a Workioad 



Number of Memory Intensive Benchmarks in a Workload 


Eigure 5.2: Average system performance and fairness across 300 workloads of different memory 
intensities 


Eirst, MISE-QoS significantly improves performance compared to AlwaysPrioritize, especially 
when the slowdown bound for the Aol is large. On average, when the bound is y, MISE-QoS 
improves harmonic speedup by 12% and weighted speedup by 10% (not shown due to lack of 

^Note that the slowdown bound becomes tighter from left to right. 
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space) over AlwaysPrioritize, while redueing maximum slowdown by 13%. Seeond, as expeeted, 
the performanee and fairness of MISE-QoS approaeh that of AlwaysPrioritize as the slowdown 
bound is deereased (going from left to right for a set of bars). Finally, the benefits of MISE- 
QoS inerease with inereasing memory intensity beeause always prioritizing a memory intensive 
applieation will eause signifieant interferenee to other applieations. 


Based on our results, we eonelude that MISE-QoS ean effeetively ensure that the Aol meets the 
speeified slowdown bound while aehieving high system performanee and fairness aeross the other 


applieations. In Seetion 5.1.4, we diseuss a ease study of a system with two Aols. 


Using STFM’s Slowdown Estimates to Provide QoS. We evaluate the effeetiveness of STEM 
in providing slowdown guarantees, by using slowdown estimates from STFM’s model to drive our 
QoS-enforeement meehanism. Table |5.3| shows the effeetiveness of STFM’s slowdown estimation 
model in meeting the preseribed slowdown bounds for the 3000 data points. We draw two ma¬ 
jor eonelusions. First, the slowdown bound is met and estimated as met for only 63.7% of the 
workloads, whereas MISE-QoS meets the slowdown bound and estimates it right for 78.8% of 


the workloads (as shown in Table 5.2). The reason is STFM’s high slowdown estimation error. 
Seeond, the pereentage of workloads for whieh the slowdown bound is met/not-met and is esti¬ 
mated wrong is 18.4%, as eompared to 4.3% for MISE-QoS. This is beeause STFM’s slowdown 
estimation model overestimates the slowdown of the Aol and alloeates it more bandwidth than 
is required to meet the preseribed slowdown bound. Therefore, performanee of the other appliea¬ 
tions in a workload suffers, as demonstrated in Figure |53] whieh shows the system performanee for 
different values of the preseribed slowdown bound, for MISE and STEM. For instanee, when the 
slowdown bound is y, STFM-QoS has 5% lower average system performanee than MISE-QoS. 
Therefore, we eonelude that the proposed MISE model enables more effeetive enforeement of QoS 
guarantees for the Aol, than the STEM model, while providing better average system performanee. 
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Scenario 

# Workloads 

% Workloads 

Bound Met and Predicted Right 

1911 

63.7% 

Bound Met and Predicted Wrong 

480 

16% 

Bound Not Met and Predicted Right 

537 

17.9% 

Bound Not Met and Predicted Wrong 

72 

2.4% 


Table 5.3: Effectiveness of STFM-QoS 
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Figure 5.3: Average system performance using MISE and STFM’s slowdown estimation models 
(across 300 workloads) 


5.1.4 Case Study: Two Aols 


QoS-1 I QoS-5 QoS-9 - ..i 

_ QoS-3 QoS-7 I 



MISE STEM 


So far, we have discussed and analyzed the benefits of MISE-QoS for a system with one Aol. 
However, there could be scenarios with multiple Aols each with its own target slowdown bound. 
One can think of two naive approaches to possibly address this problem. In the first approach, 
the memory controller can prioritize the requests of all Aols in the system. This is similar to 
the AlwaysPrioritize mechanism described in the previous section. In the second approach, the 
memory controller can equally partition the memory bandwidth across all Aols. We call this 
approach EqualBandwidth. However, neither of these mechanisms can guarantee that the Aols 


meet their target bounds. On the other hand, using the mechanism described in Section 5.1.2 
MISE-QoS can be used to achieve the slowdown bounds for multiple Aols. 


To show the effectiveness of MISE-QoS with multiple Aols, we present a case study with two 
Aols. The two Aols, astar and mcf are run in a 4-core system with leslie and another copy of mcf. 
Figure [5A| compares the slowdowns of each of the four applications with the different mechanisms. 
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The same slowdown bound is used for both Aols. 


AlwaysPrioritize [ 
EqualBandwidth [ 
MISE-QoS-1 I 
MISE-QoS-2 [ 


MISE-QoS-3 

MISE-QoS-4 

MISE-QoS-5 



astar mcf leslieSd mcf 

Figure 5.4: Meeting a target bound for two applications 


Although AlwaysPrioritize prioritizes both Aols, mcf (the more memory-intensive Aol) inter¬ 
feres significantly with astar (slowing it down by more than 7x). EqualBandwidth mitigates this 
interference problem by partitioning the bandwidth between the two applications. However, MISE- 
QoS intelligently partitions the available memory bandwidth equally between the two applications 
to ensure that both of them meet a more stringent target bound. For example, for a slowdown 
bound of MISE-QoS allocates more than 50% of the bandwidth to astar, thereby reducing 
astaPs slowdown below the bound of 2.5, while EqualBandwidth can only achieve a slowdown 
of 3.4 for astar, by equally partitioning the bandwidth between the two Aols. Furthermore, as a 
result of its intelligent bandwidth allocation, MISE-QoS significantly reduces the slowdowns of 
the other applications in the system compared to AlwaysPrioritize and EqualBandwidth (as seen 
in Figure |54l ). 

We conclude, based on the evaluations presented above, that MISE-QoS manages memory 
bandwidth efficiently to achieve both high system performance and fairness while meeting per¬ 
formance guarantees for one or more applications of interest. 
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5.2 MISE-Fair: Minimizing Maximum Slowdown 


The second mechanism we build on top of our MISE model is one that seeks to improve overall 
system fairness. Specifically, this mechanism attempts to minimize the maximum slowdown across 
all applications in the system. Ensuring that no application is unfairly slowed down while main¬ 
taining high system performance is an important goal in multicore systems where co-executing 
applications are similarly important. 


5.2.1 Mechanism 


At a high level, our mechanism works as follows. The memory controller maintains two pieces 
of information: 1) a target slowdown bound {B) for all applications, and 2) a bandwidth alloca¬ 
tion policy that partitions the available memory bandwidth across all applications. The memory 
controller enforces the bandwidth allocation policy using the lottery-scheduling technique as de¬ 


scribed in Section 4.2.1 The controller attempts to ensure that the slowdown of all applications is 
within the bound B. To this end, it modifies the bandwidth allocation policy so that applications 
that are slowed down more get more memory bandwidth. Should the memory controller find that 
bound B is not possible to meet, it increases the bound. On the other hand, if the bound is easily 
met, it decreases the bound. We describe the two components of this mechanism: 1) bandwidth 
redistribution policy, and 2) modifying target bound (B). 


Bandwidth Redistribution Policy. As described in Section 4.2.1, the memory controller di¬ 
vides execution into multiple intervals. At the end of each interval, the controller estimates the 
slowdown of each application and possibly redistributes the available memory bandwidth amongst 
the applications, with the goal of minimizing the maximum slowdown. Specifically, the controller 
divides the set of applications into two clusters. The first cluster contains those applications whose 
estimated slowdown is less than B. The second cluster contains those applications whose esti¬ 
mated slowdown is more than B. The memory controller steals a small fixed amount of bandwidth 
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allocation (2%) from each application in the first eluster and distributes it equally among the appli- 
eations in the seeond eluster. This ensures that the applieations that do not meet the target bound 
B get a larger share of the memory bandwidth. 

Modifying Target Bound. The target bound B may depend on the workload and the different 
phases within eaeh workload. This is beeause different workloads, or phases within a workload, 
have varying demands from the memory system. As a result, a target bound that is easily met for 
one workload/phase may not be aehievable for another workload/phase. Therefore, our meehanism 
dynamieally varies the target bound B by predieting whether or not the eurrent value of B is 
aehievable. For this purpose, the memory eontroller keeps traek of the number of applieations that 
met the slowdown bound during the past N intervals (3 in our evaluations). If all the applieations 
met the slowdown bound in all of the N intervals, the memory eontroller prediets that the bound 
is easily aehievable. In this ease, it sets the new bound to a slightly lower value than the estimated 
slowdown of the applieation that is the most slowed down (a more eompetitive target). On the 
other hand, if more than half the applieations did not meet the slowdown bound in all of the N 
intervals, the eontroller prediets that the target bound is not aehievable. It then inereases the target 
slowdown bound to a slightly higher value than the estimated slowdown of the most slowed down 
applieation (a more aehievable target). 


5.2.2 Interaction with the OS 


As we will show in Seetion 5.2.3 our meehanism provides the best fairness eompared to three 
state-of-the-art approaehes for memory request seheduling |I^|Ml[86l|- In addition to this, there is 
another benefit to using our approaeh. Our meehanism, based on the MISE model, ean aeeurately 
estimate the slowdown of eaeh applieation. Therefore, the memory eontroller ean potentially eom- 
munieate the estimated slowdown information to the operating system (OS). The OS ean use this 
information to make more informed seheduling and mapping deeisions so as to further improve 
system performanee or fairness. Sinee prior memory seheduling approaehes do not explieitly 
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attempt to minimize maximum slowdown by accurately estimating the slowdown of individual 
applications, such a mechanism to interact with the OS is not possible with them. Evaluating the 
benefits of the interaction between our mechanism and the OS is beyond the scope of this thesis. 


5.2.3 Evaluation 


Figure 5.5 compares the system fairness (maximum slowdown) of different mechanisms with 
increasing number of cores. The figure shows results with four previously proposed memory 
scheduling policies (FRFCFS [I97lll29ll . ATFAS ll60l . TCM lIMI . and STEM [f8^ ). and our pro¬ 
posed mechanism using the MISE model (MISE-Fair). We draw three conclusions from our results. 

FRFCFS I - 1 TCM i - 1 SEF_Fair 
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Figure 5.5: Fairness with different core counts 


First, MISE-Fair provides the best fairness compared to all other previous approaches. The re¬ 
duction in the maximum slowdown due to MISE-Fair when compared to STEM (the best previous 
mechanism) increases with increasing number of cores. With 16 cores, MISE-Fair provides 7.2% 
better fairness compared to STEM. 

Second, STEM, as a result of prioritizing the most slowed down application, provides better 
fairness than all other previous approaches. While the slowdown estimates of STEM are not as 
accurate as those of our mechanism, they are good enough to identify the most slowed down appli- 
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cation. However, as the number of eoneurrently-running applieations inereases, simply prioritizing 
the most slowed down applieation may not lead to better fairness. MISE-Fair, on the other hand, 
works towards redueing maximum slowdown by stealing bandwidth from those applieations that 
are less slowed down eompared to others. As a result, the fairness benefits of MISE-Fair eompared 
to STEM inerease with inereasing number of eores. 


Third, ATEAS and TCM are more unfair eompared to FRFCFS. As shown in prior work [|^ 
m, ATEAS trades off fairness to obtain better performanee. TCM, on the other hand, is designed 
to provide high system performanee and fairness. Further analysis showed us that the eause of 
TCM’s unfairness is the striet ranking employed by TCM. TCM ranks all applieations based on 
its elustering and shuffling teehniques liCT and strietly enforees these rankings. We found that 
sueh striet ranking destroys the row-buffer loeality of low-ranked applieations. This inereases the 
slowdown of sueh applieations, leading to high maximum slowdown. 

FRFCFS I - 1 TCM i - 1 MISE-Fair 

ATLAS I - 1 STFM i - 1 
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Percentage of Memory Intensive Benchmarks in a Workload 
Figure 5.6: Fairness for 16-oore workloads 
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Effect of Workload Memory Intensity on Fairness. Figure |5.6| shows the maximum slow¬ 
down of the 16-eore workloads eategorized by workload intensity. While most trends are similar 
to those in Figure 5.5[ we draw the reader’s attention to a speeifie point: for workloads with 
non-memory-intensive applieations (25%, 50% and 75% in the figure), STFM is more unfair than 
MISE-Fair. As shown in Figure |4.3[ STFM signifieantly overestimates the slowdown of non- 
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memory-bound applications. Therefore, for these workloads, we find that STFM prioritizes such 
non-memory-bound applications which are not the most slowed down. On the other hand, MISE- 
Farr, with its more accurate slowdown estimates, is able to provide better fairness for these work¬ 
load categories. 


System Performance. Figure |5.7| presents the harmonic speedup of the four previously pro¬ 
posed mechanisms (FRFCFS, ATLAS, TCM, STFM) and MISE-Fair, as the number of cores is 
varied. The results indicate that STFM provides the best harmonic speedup for 4-core and 8-core 
systems. STFM achieves this by prioritizing the most slowed down application. However, as the 
number of cores increases, the harmonic speedup of MISE-Fair matches that of STFM. This is 
because, with increasing number of cores, simply prioritizing the most slowed down application 
can be unfair to other applications. In contrast, MISE-Fair takes into account slowdowns of all 
applications to manage memory bandwidth in a manner that enables good progress for all appli¬ 
cations. We conclude that MISE-Fair achieves the best fairness compared to prior approaches, 
without significantly degrading system performance. 


FRFCFS I - 1 STFM 

ATLAS I - 1 MISE-Fair 

TCM I - 1 



4 8 16 

Number of Cores 

Figure 5.7: Harmonic speedup with different core counts 
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5.3 Summary 


We present two new main memory request seheduling meehanisms that use MISE to aehieve two 
different goals: 1) MISE-QoS aims to provide soft QoS guarantees to one or more applications of 
interest while ensuring high system performance, 2) MISE-Eair attempts to minimize maximum 
slowdown to improve overall system fairness. Our evaluations show that our proposed mecha¬ 
nisms are more effective than the state-of-the-art memory scheduling approaches [|4^ |M1 IMl 

in achieving their respective goals, thereby demonstrating the MISE model’s effectiveness in esti¬ 
mating and controlling application slowdowns. 
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Chapter 6 


Quantifying Application Slowdowns Due to 
Both Shared Cache Interference and Shared 
Main Memory Interference 


In a multicore system, the shared cache is a key source of contention among applications. Applica¬ 
tions that share the cache contend for its limited capacity. The shared cache capacity allocated to an 
application directly determines its memory intensity and hence, the degree of memory interference 
in a system. 

Figure [O] shows the slowdown of two representative applications, bzip2 and soplex, when they 
share main memory alone and when they share both shared caches and main memory. As can be 
seen, when the two applications share the cache, their slowdown increases significantly compared 
to when they share main memory alone. We observe such shared cache interference across several 
applications and workloads. 

While the MISE model focuses on estimating slowdowns due to contention for main memory 
bandwidth, it does not take into account interference at the the shared caches. We propose to 
take into account the effect of shared cache capacity interference, in addition to main memory 
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Figure 6.1: Impact of shared cache interference on application slowdowns 

bandwidth interference, in estimating application slowdowns. 

Previous works, FST ^TTl and PTCA [|25l attempt to estimate slowdown due to both shared 
cache and main memory interference. However, they are inaccurate, since they quantify the impact 
of interference at a per-request granularity, as we described in Chapters and The presence of a 
shared cache only makes the problem worse as the request stream of an application to main memory 
could be completely different depending on whether or not the application shares the cache with 
other applications. We strive to estimate an application’ slowdown accurately in the presence of 
interference at both the shared cache and the main memory. Towards this end, we propose the 
Application Slowdown Model (ASM). 



bzip2 soplex bzip2 soplex 


6.1 Overview of the Application Slowdown Model (ASM) 

In contrast to prior works which quantify interference at a per-request granularity, ASM uses ag¬ 
gregate request behavior to quantify interference, based on the following observation. 

6.1.1 Observation: Access rate as a proxy for performance 

The performance of each application is proportional to the rate at which it accesses 
the shared cache. 
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Intuitively, an application can make progress when its data accesses are served. The faster 
its accesses are served, the faster it makes progress. In the steady state, the rate at which an 
application’s accesses are served (service rate) is almost the same as the rate at which it generates 
accesses (access rate). Therefore, if an application can generate more accesses to the cache in a 
given period of time (higher access rate), then it can make more progress during that time (higher 
performance). 

MISE observes that the performance of a memory-bound application is proportional to the rate 
at which its main memory accesses are served. However, this observation is stronger than MISE’s 
observation because this observation relates performance to the shared cache access rate and not 
just main memory access rate, thereby accounting for the impact of both shared cache and main 
memory interference. Hence, it holds for a broader class of applications that are sensitive to cache 
capacity and/or main memory bandwidth, and not just memory-bound applications. 

To validate our observation, we conducted an experiment in which we run each application 
of interest alongside a hog program on an Intel Core-i5 processor with 6MB shared cache. The 
cache and memory access behavior of the hog can be varied to cause different amounts of inter¬ 
ference to the main program. Each application is run multiple times with the hog with different 
characteristics. During each run, we measure the performance and shared cache access rate of the 
application. 

Eigure|6^ plots the results of our experiment for three applications from the SPEC CPU2006 
suite [|6l. The plot shows cache access rate vs. performance of the application normalized to when 
it is run alone. As our results indicate, the performance of each application is indeed proportional 
to the cache access rate of the application, validating our observation. We observed the same 
behavior for a wide range of applications. 

ASM exploits our observation to estimate slowdown as a ratio of cache access rates, instead of 
as a ratio of performance. 
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Normalized Cache Access Rate 
(norm, to cache access rate when run alone) 

Figure 6.2: Cache access rate vs. performance 

performance oc cache-access-rate (CAR) 

performance CAR^i^^e 

Slowdown = ---=- 

performance CA7?shaied 

While CA7?shared/ performance are both easy to measure, the challenge is in estimating 
performance or CA7?aione- 

CA/faione VS. performance In order to estimate an application’s slowdown during a given 
interval, prior works such as FST and PTCA estimate its alone execution time (performance 
by tracking the interference experienced by each of the application’s requests served during this 
interval and subtracting these interference cycles from the application’s shared execution time 
(performance This approach leads to inaccuracy, since estimating per-request interference 
is difficult due to the parallelism in the memory system. CAi^aione. on the other hand, can be esti¬ 
mated more accurately by exploiting the observation made by several prior works that applications’ 
phase behavior does not change significantly over time scales on the order of a few million cycles 
(e.g., 111031143 ). Hence, CAi?aione can be estimated periodically over short time periods during 
which main memory interference is minimized (thereby implicitly accounting for memory level 
parallelism) and shared cache interference is quantified, rather than throughout execution. We 
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describe this in detail in the next section. 


6.1.2 Challenge: Accurately Estimating CA/faione 


A naive way of estimating C4i?aione of an application periodically is to run the application by 
itself for short periods of time and measure C4i?aione- While such a scheme would eliminate main 
memory interference, it would not eliminate shared cache interference, since the caches cannot 
be warmed up at will in a short time duration. Hence, it is not possible to take this approach 
to estimate CAi?aione accurately. Therefore, ASM takes a hybrid approach to estimate CA7?aione for 
each application by 1) minimizing interference at the main memory, and 2) quantifying interference 
at the shared cache. 


Minimizing main memory interference. ASM minimizes interference for each application at 
the main memory by simply giving each application’s requests the highest priority in the memory 
controller periodically for short lengths of time, similar to MISE. This has two benefits. First, 
it eliminates most of the impact of main memory interference when ASM is estimating CAi?aione 


for the application (remaining minimal interference accounted for in Section 6.2.3). Second, it 
provides ASM an accurate estimate of the cache miss service time for the application in the absence 
of main memory interference. This estimate will be used in the next step, in quantifying shared 
cache interference for the application. 


Quantifying shared cache interference. To quantify the effect of cache interference, we need 
to identify the excess cycles that are spent in serving shared cache misses that are contention 
misses —those that would have otherwise hit in the cache had the application run alone on the 
system. We use an auxiliary tag store for each application to first identify contention misses. Once 
we determine the aggregate number of contention misses, we use the average cache miss service 
time (computed in the previous step) and average cache hit service time to estimate the excess 
number of cycles spent serving the contention misses—essentially quantifying the effect of shared 
cache interference. 
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6.1.3 ASM vs. Prior Work 


ASM is better than prior work due to three reasons. First, as we describe in Section 2.11 and in 
the beginning of this chapter, prior works aim to estimate the effect of main memory interference 
on each contention miss individually, which is difficult and inaccurate. In contrast, our approach 
eliminates most of the main memory interference for an application by giving the application’s 
requests the highest priority, which also allows ASM to gather a good estimate of the average cache 
miss service time. Second, to quantify the effect of shared cache interference, ASM only needs to 
identify the number of contention misses, unlike prior approaches that need to determine whether 
or not every individual request is a contention miss. This makes ASM more amenable to hardware- 


overhead-reduction techniques like set sampling (more details in Sections 6.2.4 and 6.2.5). In 
other words, the error introduced by set sampling in estimating the number of contention misses 
is far lower than the error it introduces in estimating the actual number of cycles by which each 
contention miss is delayed due to interference. Third, as we describe in Section [7T| ASM enables 
estimation of slowdowns for different cache allocations in a straightforward manner, which is non¬ 
trivial using prior models. 


In summary, ASM estimates application slowdowns as a ratio of cache access rates. ASM over¬ 
comes the challenge of estimating CAi?aione by minimizing interference at the main memory and 
quantifying interference at the shared cache. In the next section, we describe the implementation 
of ASM. 


6.2 Implementing ASM 

ASM divides execution into multiple quanta, each of length Q cycles (a few million cycles). At 
the end of each quantum, ASM 1) measures CA7?shared, and 2) estimates CAi?aione for each appli¬ 
cation, and reports the slowdown of each application as the ratio of the application’s CAi?aione and 
CAi? shared- 
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6.2.1 Measuring CA/?shared 


Measuring CA7?shared for each application is fairly straightforward. ASM keeps a per-application 
counter that tracks the number of shared cache accesses for the application. The counter is cleared 
at the beginning of each quantum and is incremented whenever there is a new shared cache access 
for the application. At the end of each quantum, the CA7?shared for each application can be computed 
as 

, # Shared Cache Accesses 

cache-access-rateshm:ed = - 7 :;- 


6.2.2 Estimating CA/?aione 


As we described in Section [C1.2| , during each quantum, ASM periodically estimates the CA7?aione of 
each application by minimizing interference at the main memory and quantifying the interference 
at the shared cache. Towards this end, ASM divides each quantum into epochs of length E cycles 
(thousands of cycles), similar to MISE. Each epoch is probabilistically assigned to one of the 
co-running applications. During each epoch, ASM collects information for the corresponding 
application that will later be used to estimate CA7?aione for the application. Each application has 
equal probability of being assigned an epoch. Assigning epochs to applications in a round-robin 
fashion could also achieve similar effects. However, we build mechanisms on top of ASM that 
allocate bandwidth to applications in a slowdown-aware manner (Section [T!^ , similar to MISE- 
QoS and MISE-Eair. Therefore, in order to facilitate building such mechanisms on top of ASM, 
we employ a policy that probabilistically assigns an application to each epoch. 


At the beginning of each epoch, ASM communicates the ID of the application assigned to the 
epoch to the memory controller. During that epoch, the memory controller gives the corresponding 
application’s requests the highest priority in accessing main memory. 


To track contention misses, ASM maintains an auxiliary tag store for each application that 
tracks the state of the cache had the application been running alone. The auxiliary tag store of an 
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Name 

Definition 

epoch-count 

Number of epochs assigned to the application 

epoch-hits 

Total number of shared cache hits for the application during its assigned 
epochs 

epoch-misses 

Total number of shared cache misses for the application during its assigned 
epochs 

epoch-hit-time 

Number of cycles during which the application has at least one outstanding 
hit during its assigned epochs 

epoch-miss-time 

Number of cycles during which the application has at least one outstanding 
miss during its assigned epochs 

epoch-ATS-hits 

Number of auxiliary tag store hits for the application during its assigned 
epochs 

epoch-ATS-misses 

Number of auxiliary tag store misses for the application during its assigned 
epochs 


Table 6.1: Quantities measured by ASM for eaeh applieation to estimate CAi?aione 


applieation holds the tag entries alone (not the data) of eache blocks. When a request from another 
application evicts an application’s block from the shared cache, the tag entry corresponding to the 
evicted block still remains in the application’s auxiliary tag store. Hence, the auxiliary tag store 
effectively tracks the state of the cache had the application been running alone on the system. 


In this section, we will assume a full auxiliary tag store for ease of description. However, as we 
will describe in Section 6.2.4[ our final implementation uses set sampling to significantly reduce 
the overhead of the auxiliary tag store with negligible loss in accuracy. 


Table |6.1 1 lists the quantities that are measured by ASM for each application during the epochs 
that are assigned to the application. At the end of each quantum, ASM uses these quantities to 
estimate the CA7?aione of the application. These metrics can be measured using a counter for each 
quantity while the application is running with other applications. 


The CA7?aione of an application is given by. 


# Requests served during application’s epochs 
Time to serve above requests when run alone 
epoch-hits + epoch-misses 
{epoch-count * E) — epoch-excess-cycles 
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where, epoch-count * E represents the aetual time the system spent serving those requests from the 
applieation, and epoch-excess-cycles is the number of exeess eyeles spent serving the applieation’s 
eontention misses—those that would have been hits had the applieation run alone. 

At a high level, for eaeh eontention miss, the system spends the time of serving a miss as 
opposed to a hit had the applieation been running alone. Therefore, 
epoch-excess-cycles = (# Contention Misses) x (avg-miss-time — avg-hit-time) 

where, avg-miss-time is the average miss serviee time and avg-hit-time is the average hit serviee 
time for the applieation for requests served during the applieation’s epoehs. Eaeh of these terms 
ean be eomputed using the quantities measured by ASM, as follows. 

# Contention Misses = epoch-ATS-hits — epoch-hits 

epoch-miss-time 

Average Miss Serviee Time = - 

epoch-misses 

epoch-hit-time 

Average Hit Serviee Time = -—- 

epoch-hits 

6.2.3 Accounting for Memory Queueing 

During eaeh epoeh, when there are no requests from the highest priority applieation, the memory 
eontroller may sehedule requests from other applieations. If a high priority request arrives after 
another applieation’s request is seheduled, it may be delayed. To address this problem, we ap¬ 
ply a similar meehanism as the interferenee eyele estimation meehanism in MISE (Seetion [4.2.3| ), 
wherein ASM measures the number of queueing eyeles for eaeh applieation using a eounter. A ey- 
ele is deemed a queueing eyele if a request from the highest priority applieation is outstanding and 
the previous eommand issued by the memory eontroller was from another applieation. At the end 
of eaeh quantum, the eounter represents the queueing delay for all epoch-misses. However, sinee 
ASM has already aeeounted for the queueing delay of the eontention misses during its previous 
estimate by removing the epoch-excess-cycles taken to serve eontention misses, it only needs to 
aeeount for the queueing delay for the remaining true misses, i.e., epoch-ATS-misses. In order to 
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do this, ASM computes the average queueing eyele for eaeh miss from the applieation. 

# queueing eyeles 

avg-queueing-delay = - - - 

epoch-misses 


and eomputes its final CAi?aione estimate as 


epoch-hits + epoch-misses 

{epoch-count * E) — epoch-excess-cycles — {epoch-ATS-misses * avg-queueing-delay) 


6.2.4 Sampling the Auxiliary Tag Store 

As we mentioned before, in our final implementation, we use set sampling to reduee the overhead 
of the auxiliary tag store (ATS). Using this approaeh, the ATS is maintained only for a few sampled 
sets. The only two quantities that are affeeted by sampling are epoch-ATS-hits and epoch-ATS- 
misses. With sampling enabled, we first measure the fraetion of hits/misses in the sampled ATS. 
We then eompute epoch-ATS-hits/epoch-ATS-misses as a produet of the hit/miss fraetion with the 
total number of eaehe aeeesses. 

epoch-ATS-hits = ats-hit-fraction x epoch-accesses 
epoch-ATS-misses = ats-miss-fraction x epoch-accesses 
where epoch-accesses = epoch-hits + epoch-misses 


6.2.5 Hardware Cost 


ASM traeks the seven quantities in Table 6.1 and # queueing eyeles using registers. We find that 
using a four byte register for eaeh of these eounters is more than suffieient for the values they 
keep traek of. Henee, the eounter overhead is 32 bytes for eaeh applieation. In addition to these 
eounters, an auxiliary tag store (ATS) is maintained for eaeh applieation. The ATS size depends 
on the number of sets that are sampled. For 64 sampled sets and 16 ways per set, assuming four 
bytes for eaeh entry, the overhead is 4KB per-application, whieh is 0.2% the size of a 2MB eaehe 
(used in our main evaluations). 
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6.3 Methodology 


System Configuration. We model the main memory system using a eyele-level in-house DDR3- 
SDRAM simulator. We validated the simulator against DRAMSim2 [|9^ and Micron’s behavioral 
Verilog model [[80l . We integrate our DRAM simulator with an in-house simulator that models 
out-of-order cores with a Pin [1731 frontend. Each system consists of a per-core private LI cache 


and a shared L2 cache. Table 6.2 lists the main system parameters. Our main evaluations use a 
4-core system with a 2MB shared cache and 1-channel main memory. 


Processor 

4-16 cores, 5.3GHz, 3-wide issue, 128-entry instruction window 

LI cache 

64KB, private, 4-way associative, LRU, line size = 64B, 
latency = 1 cycle 

Last-level cache 

1MB-4MB, shared, 16-way associative, LRU, line size = 64B, 
latency = 20 cycles 

Memory controller 

128-entry request buffer per controller, FR-FCFS Il97l scheduling 
policy 

Main Memory 

DDR3-1333 (10-10-10) llTOll. 1-4 channels, 1 rank/channel, 

8 banks/rank, 8KB rows 


Table 6.2: Configuration of the simulated system 


Workloads. For our multiprogrammed workloads, we use applications from the SPEC CPU2006 [0 
and NAS Parallel Benchmark dH suites (run single-threaded). We construct workloads with vary¬ 
ing memory intensity—applications for each workload are chosen randomly. We run each work¬ 
load for 100 mi llion cycles. In all, we present results for 100 4-core, 100 8-core and 100 16-core 
workloads. 


Metrics. We use average error to compare the accuracy of ASM and previously proposed models 
(similar to MISE). We compute slowdown estimation error for each application, at the end of every 


quantum (Q), as the absolute value of 

Estimated Slowdown 
Error =- 


Actual Slowdown 


Actual Slowdown 


X 100% 
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Actual Slowdown = 

We compute I PC aione for tho same amount of work as the shared run for each quantum. For each 
application, we compute the average slowdown estimation error across all quanta in a workload 
run and then compute the average across all occurrences of the application in all of our workloads. 


Parameters. We compare ASM with two previous slowdown estimation models: Fairness via 
Source Throttling (FST) [f27l and Per-Thread Cycle Accounting (PTCA) [|23ll . For ASM, we set the 
quantum length (Q) to 5,000,000 cycles and the epoch length (E) to 10,000 cycles. For ASM and 
PTCA, we present results both with sampled and unsampled auxiliary tag stores (ATS). For FST, 


we present results with various pollution filter sizes that match the size of the ATS. Section 6.4.5 
evaluates the sensitivity of ASM to sampling, quantum and epoch lengths. 


6.4 Evaluation of the Model 


6.4.1 Slowdown Estimation Accuracy 


Figure [O] compares the average slowdown estimation error from FST, PTCA, and ASM, with no 
sampling in the auxiliary tag store for PTCA and ASM, and equal-overhead pollution filter for FST. 
The benchmarks on the left are from SPEC CPU2006 suite and those on the right are from NAS 
benchmark suite. Benchmarks within each suite are sorted based on memory intensity. Figure [6^ 
presents the corresponding results with a sampled auxiliary tag store (64 cache sets) for PTCA and 
ASM, and an equal-size pollution filter for FST. 


We draw three major conclusions. First, even without sampling, ASM has significantly lower 
slowdown estimation error (9%) compared to FST (18.5%) and PTCA (14.7%) (error in estimating 
P^^aione Similar). This is because, as described in Section |6.2[ prior works attempt to 

quantify the effect of interference on a per-request basis, which is inherently difficult and inaccurate 
given the abundant parallelism in the memory subsystem. ASM, in contrast, uses aggregate request 
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Figure 6.3: Slowdown estimation aeeuraey with no sampling 
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Figure 6.4: Slowdown estimation aeeuraey with sampling 


behavior to quantify the effeet of interferenee, and henee is more aeeurate. Our error estimates for 
PTCA are higher than what is reported in their paper [|^ . This is beeause they use first-eome- 
first-served seheduling at the memory eontroller. It is easier to estimate the effeet of interferenee 
using this poliey than with the FR-FCFS [[97l poliey whieh ean dynamieally reorder requests from 
different applieations. 

Seeond, sampling the auxiliary tag store and reducing the size of the pollution filter significantly 
increase the slowdown estimation error of PTCA and FST respectively, while it has negligible im¬ 
pact on the slowdown estimation error of ASM. PTCA’s error increases from 14.7% to 40.4% and 
FST’s error increases from 18.5% to 29.4%, whereas ASM’s error increases from 9% to only 9.9%. 
PTCA’s error increases from sampling is because it estimates the number of cycles by which each 
contention miss (from the sampled sets) is delayed, and scales up this number to the entire cache. 
However, since different requests may experience different levels of interference, this scaling in¬ 
troduces more error in PTCA’s estimates. FST’s slowdown estimation error also increases from 
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sampling, but the increase is not as significant as PTCA’s increase from sampling, because it uses 
a pollution filter that is implemented using a Bloom filter iTTSll . which is robust to size reductions. 
ASM’s slowdown estimation error does not increase much from sampling, since the slowdown of 
an application is estimated only when an application has highest priority and is experiencing min¬ 
imal interference at the main memory. Quantifying the impact of shared cache interference using 
aggregate contention miss counts and average high priority miss service time estimates, is easier 
and more accurate when an application’s memory interference is minimized, rather than tracking 
per-request interference when an application is experiencing interference at both the shared cache 


and main memory. Section 6.4.5 presents more detailed evaluations demonstrating the impact of 
sampling. 


Third, FST and PTCA’s slowdown estimates are particularly inaccurate for applications with 
high memory intensity (e.g., soplex, libquantum, mcf) and high cache sensitivity (e.g., NPBft, 
dealll, bzip2). This is because applications with high memory intensity generate a large number 
of requests to memory, and accurately modeling the overlap in service of such large number of 
requests is difficult, resulting in inaccurate slowdown estimates. Similarly, an application with high 
cache sensitivity is severely affected by shared cache interference. Hence, the request streams to 
main memory of the application will be drastically different when it is run alone vs. when it shares 
the cache with other applications. This makes it hard to estimate per-request interference. ASM 
simplifies the problem by minimizing main memory interference and tracking aggregate rather 
than per-request behavior when memory interference is minimized, resulting in significantly lower 
error than prior work for applications with high memory intensity and/or cache sensitivity. 


In summary, with reasonable hardware overhead, ASM estimates slowdowns more accurately 
than prior work and is more robust to applications with varying access behaviors. 
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6.4.2 Distribution of Slowdown Estimation Error 


Figure |6.5| shows a distribution of slowdown estimation error for FST, PTCA (unsampled) and 
ASM (sampled), across all the 400 instances of different applications in our 100 4-core workloads. 
The x-axis shows error ranges and the y-axis shows what fraction of points lie in each range. Two 
observations are in order. First, 95.25% of ASM’s slowdown estimates have an error less than 
20%, whereas only 76.25% and 79.25% of FST and PTCA’s estimates respectively lie within the 
20% mark. Second, ASM’s maximum error is only 36%, while FST and PTCA have maximum 
errors of 133% and 87% respectively. We observe that ASM’s maximum error is for astar, which 
has moderate memory intensity. When astar is run with other higher intensity applications, it 
is difficult to account for memory queueing accurately for astar, despite employing the technique 


described in Section 6.2.3, leading to inaccuracy. The highest errors for FST/PTCA are for Ibmimcf 
respectively, which are among the most memory-intensive of the applications we evaluate. As 
described in Section 6.4. 1[ the request overlap when run alone vs. together is particularly dissimilar 
for such applications, making it difficult to estimate alone behavior accurately by tracking per- 
request interference. We conclude that ASM’s slowdown estimates have much lower variance than 
FST and PTCA’s estimates and are more robust. 



10 20 30 50 70 100 

Error (in %) 

Figure 6.5: Error distribution 
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6.4.3 Impact of Prefetching 


Figure |6.6| shows the average slowdown estimation error for FST, PTCA and ASM, aeross 100 
4-oore workloads (unsampled), with a stride prefeteher ifT^ of degree four. ASM aehieves a sig- 
nifieantly low error of 7.5%, eompared to 20% and 15% for FST and PTCA respectively. ASM’s 
error reduces compared to not employing a prefeteher, since memory interference induced stalls 
reduce with prefetching, thereby reducing the amount of interference whose impact on slowdowns 
needs to be estimated. This reduction in interference is true for FST and PTCA as well. However, 
their error increases slightly compared to not employing a prefeteher, since they estimate interfer¬ 
ence at a per-request granularity. The introduction of prefetch requests causes more disruption and 
hard-to-estimate overlap behavior among requests going to main memory, making it even more 
difficult to estimate interference at a per-request granularity. In contrast, ASM uses aggregate re¬ 
quest behavior to estimate slowdowns, which is more robust, resulting in more accurate slowdown 
estimates with prefetching. 


FST I - 1 PTCA I - 1 ASM 



Figure 6.6: Prefetching impact 


6.4.4 Sensitivity to System Parameters 

Core Count. Figure |677] presents the sensitivity of slowdown estimates from FST, PTCA and ASM 
to the number of cores. Since PTCA and FST’s slowdown estimation error degrades significantly 
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with sampling, for all studies from now on, we present results for prior works with no sampling. 
However, for ASM, we still present results with a sampled auxiliary tag store. We evaluate 100 
workloads for each core count. We draw two conclusions. First, ASM’s slowdown estimates are 
significantly more accurate than slowdown estimates from FST and PTCA across all core counts. 
Second, ASM’s accuracy gains over FST and PTCA increase with increasing core count. As core 
count increases, interference at the shared cache and main memory increases and consequently, 
request behavior at the shared cache and main memory is even more different from alone run 
behavior. ASM tackles this problem by tracking aggregate request behavior, thereby scaling more 
effectively to larger systems with more number of cores. 

Cache Capacity. Figure [6^ shows the sensitivity of slowdown estimates from ASM and previous 
schemes to cache capacity across all our 4-core workloads. ASM’s slowdown estimates are signif¬ 
icantly more accurate than slowdown estimates from FST and PTCA, across all cache capacities. 


FST I - 1 PTCA I - 1 ASM FST i - 1 PTCA i - 1 ASM 



Number of Cores Cache Capacity 

Figure 6.7: Sensitivity to core count Figure 6.8: Sensitivity to cache capacity 


6.4.5 Sensitivity to Algorithm Parameters 


ATS/Pollution Filter Size. As we already explained in Section 6.4.1 ASM is robust to reduction 
in the size of the auxiliary tag store (ATS). In contrast, FST and PTCA are significantly affected 
when the size of the pollution filter and ATS respectively are reduced. Figure [6^ illustrates this 
further by plotting the sensitivity of slowdown estimates from ASM, FST and PTCA to size of the 
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Figure 6.9: Sensitivity to ATS size 


auxiliary tag store (ATS)/pollution filter. The left most bar for PTCA and ASM eorresponds to no 
sampling (128KB ATS) and as we move to the right, we deerease the number of sampled sets. The 
right most bar eorresponds to 64 sampled sets (4KB ATS). For FST, we set the size of the pollution 
filter to be the same as the eorresponding ATS. As expeeted, varying the size of the ATS has no 
visible impaet on the estimation error of ASM, whereas the estimation error of FST and PTCA 
inerease with more aggressive sampling. 

Epoch and Quantum Lengths. Table [6^ shows the average slowdown estimation error, aeross all 
our workloads, for different values of the quantum {Q) and epoeh lengths {E). As the table shows, 
the estimation error inereases with deereasing quantum length and inereasing epoeh length. This 
is beeause the number of epoehs {^/e) deereases as quantum length {Q) deereases and/or epoeh 
length {E) inereases. With fewer epoehs, eertain applieations may not be assigned enough epoehs 
to enable ASM to reliably estimate their CA/?aione- For our main evaluations, we use a quantum 
length of 5,000,000 eyeles and epoeh length of 10,000 eyeles. 
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——Epoch Length 

10000 

50000 

100000 

Quantum Length ~—-—— 




1000000 

12% 

14% 

16.6% 

5000000 

9.9% 

10.6% 

11.5% 

10000000 

9.2% 

9.9% 

10.5% 


Table 6.3: Sensitivity to epoch and quantum lengths 


6.5 Summary 

We present the Application Slowdown Model (ASM) to estimate the slowdowns of applications 
running concurrently on a multicore system due to both shared cache and main memory interfer¬ 
ence. We observe that the performance of each application is proportional to the rate at which the 
application accesses the shared cache. ASM exploits this observation to quantify interference using 
the aggregate request behavior of each application, by minimizing interference at the main mem¬ 
ory and quantifying interference at the shared cache. As a result, ASM estimates slowdown more 
accurately than prior works, which rely on quantifying interference at a much finer per-request 
granularity. 
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Chapter 7 


Applications of ASM 


ASM’s ability to estimate applieation slowdowns due to both shared eaehe and main memory inter- 
ferenee can be leveraged to build various hardware, software and hardware-software-cooperative 
slowdown-aware resource management mechanisms to improve performance, fairness, and pro¬ 
vide soft slowdown guarantees. Furthermore, accurate slowdown estimates can be used to drive 
fair pricing schemes based on slowdowns, rather than just resource allocation or virtual machine 
migration, in a cloud computing setting [|3l[Il. We explore five such use cases of ASM, in this 
chapter. 


7.1 ASM Cache Partitioning (ASM-Cache) 

ASM-Cache partitions the shared cache capacity among applications with the goal of minimizing 
slowdown. The basic idea is to allocate more cache ways to applications whose slowdowns reduce 
the most when given additional cache capacity. 
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7.1.1 Mechanism 


ASM-Cache consists of two main components. First, to partition the cache in a slowdown-aware 
manner, we estimate the slowdown of eaeh applieation when the applieation is given different 
number of cache ways. Next, we determine the eaehe way alloeation for eaeh applieation based on 
the slowdown estimates using a meehanism similar to Utility-based Caehe Partitioning [[95ll . 


Slowdown Estimation. Using the observation described in Section [6?T| we estimate slowdown of 
an applieation when it is allocated n ways as 


slowdown^ = 


UAi? alone 
CAR„ 


where, CARn is the eaehe aeeess rate of the applieation when n ways are alloeated for the appli¬ 


eation. We estimate C47?aione using the mechanism described in Section 6.2 While CARn can be 
estimated by measuring it while giving all possible way allocations to each application, such an 
approach is expensive and detrimental to performanee as the seareh spaee is huge. Therefore, we 
propose to estimate CARn using a meehanism similar to estimating CAi?aione- 


Let quantum-hits and quantum-misses be the number of shared eaehe hits and misses for the 


applieation during a quantum. At the end of the quantum, 

quantum-hits -l- quantum-misses 

CARji = - 

# Cyeles to serve above aeeesses with n ways 

The ehallenge is in estimating the denominator, i.e., the number of eyeles taken to serve an ap- 


plieation’s shared eaehe aeeesses during the quantum, if the applieation had been given n ways. 
To estimate this, we first determine the number of shared cache accesses that would have hit in 


the eaehe had the application been given n ways (quantum-hitSn)- This ean be directly obtained 
from the auxiliary tag store. (We use a sampling auxiliary tag store and seale up the sampled 


quantum-hitSn value using the meehanism deseribed in Seetion 6.2.4 ) 


There are three eases: 1) quantum-hitSn = quantum-hits, 2) quantum-hitSn > quantum-hits, and 
3) quantum-hitSn < quantum-hits. In the first ease, when the number of hits with n ways is the 
same as the number of hits during the quantum, we expeet the system to take the same number of 
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cycles to serve the requests even with n ways, i.e., Q eyeles. In the seeond ease, when there are 
more hits with n ways, we expeet the system to serve the requests in fewer than Q eyeles. Finally, 
in the third ease, when there are fewer hits with n ways, we expeet the system to take more than Q 
eyeles to serve the requests. Let Ahits denote quantum-hitSn — quantum-hits. If quantum-hit-time 
and quantum-miss-time are the average eaehe hit serviee time and average eaehe miss serviee time 
for the aeeesses of the applieation during the quantum, we estimate the number of eyeles to serve 
the requests with n ways as, 

cycles^ = Q — Ahits{quantum-miss-time — quantum-hit-time) 

wherein we remove/add the estimated exeess eyeles spent in serving the additional hits/misses 

respeetively for the applieation with n ways. Henee, CARn is, 

quantum-hits + quantum-misses 
Q — Ahits {quantum-miss-time — quantum-hit-time) 

It is important to note that extending ASM to estimate applieation slowdowns for different 
possible eaehe alloeations is straightforward sinee we use aggregate eaehe aeeess rates to esti¬ 
mate slowdowns. Caehe aeeess rates for different eaehe eapaeity alloeations ean be estimated in a 
straightforward manner. In eontrast, extending previous slowdown estimation teehniques sueh as 
FST and PTCA to estimate slowdowns for different possible eaehe alloeations would require esti¬ 
mating if every individual request would have been a hit/miss for every possible eaehe alloeation, 
whieh is non-trivial. 

Cache Partitioning. Onee we have eaeh applieation’s slowdown estimates for different way allo¬ 
eations, we use the look-ahead algorithm used in Utility-based Caehe Partitioning (UCP) [|951 to 
partition the eaehe ways sueh that the overall slowdown is minimized. Similar to the marginal miss 
utility (used by UCP), we define marginal slowdown utility as the deerease in slowdown per extra 
alloeated way. Speeifieally, for an applieation with a eurrent alloeation of n ways, the marginal 
slowdown utility of alloeating k additional ways is, 

Slowdown-Utility^^'' =---^ 

Starting from zero ways for eaeh applieation, the marginal slowdown utility is eomputed for all 
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possible way allocations for all applications. The application that has maximum slowdown utility 
for a certain allocation is given those number of ways. This process is repeated until all ways are 
allocated. For more details on the partitioning algorithm, please refer to the look-ahead algorithm 
presented in [|951 . 

7.1.2 Evaluation 

Figure |7.1| compares the system performance and fairness of ASM-Cache against a baseline that 
employs no cache partitioning (NoPart) and utility-based cache partitioning (UCP) [l95l . for differ¬ 
ent core counts. We simulate 100 workloads for each core count. We use the harmonic speedup 
metric to measure system performance and the maximum slowdown metric to measure unfairness. 
Three observations are in order. First, ASM-Cache provides significantly better fairness and com¬ 
parable/better performance across all core counts, compared to UCP. This is because ASM-Cache 
explicitly takes into account application slowdowns in performing cache allocation, whereas UCP 
uses miss counts as a proxy for performance. Second, ASM-Cache’s gains increase with increas¬ 
ing core count, reducing unfairness by 12.5% on the 8-core system and reducing unfairness by 
15.8% and improving performance by 5.8% on the 16-core system. This is because contention for 
cache capacity increases with increasing core count, offering more opportunity for ASM-Cache to 
mitigate unfair application slowdowns. Third, we see significant fairness improvements of 12.5% 
with a larger (4 MB) cache, on a 16-core system (plots not presented due to space constraints). 
We conclude that accurate slowdown estimates from ASM can enable effective cache partitioning 
among contending applications, thereby improving fairness and performance. 


7.2 ASM Memory Bandwidth Partitioning 

In this section, we present ASM Memory Bandwidth Partitioning (ASM-Mem), a scheme to par¬ 
tition memory bandwidth among applications, based on slowdown estimates from ASM, with the 
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Figure 7.1: ASM-Cache: Fairness and performance 


goal of improving fairness. The basic idea behind ASM-Mem is to allocate bandwidth to each 
application proportional to its estimated slowdown, such that applications that have higher slow¬ 
downs are given more bandwidth. 


7.2.1 Mechanism 

ASM is used to estimate all applications’ slowdowns at the end of every quantum, Q. These slow¬ 
down estimates are then used to determine the bandwidth allocation of each application. Specifi¬ 
cally, the probability with which an epoch is assigned to an application is proportional to its esti¬ 
mated slowdown. The higher the slowdown of the application, the higher the probability that each 
epoch is assigned to the application. For an application Ai, probability that an epoch is assigned to 
the application is given by, 

, , . , . slowdowniAi) 

Probability of assigning an epoch to Ai = ——;- - -—— 

/ slowdowfiyA]^) 

At the beginning of each epoch, the epoch is assigned to one of the applications based on the 
above probability distribution and requests of the corresponding application are prioritized over 
other requests during that epoch, at the memory controller. 
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7.2.2 Evaluation 


We compare ASM-Mem with three previously proposed memory schedulers, FRFCFS, PARES 
and TCM. FRFCFS [|97lll29ll is an application-unaware scheduler that prioritizes row-buffer hits 
(to maximize bandwidth utilization) and older requests (for forward progress). FRFCFS tends to 
unfairly slow down applications with low row-buffer locality and low memory intensity. To tackle 
this problem, application-aware schedulers such as PARES l(87l and TCM llMI have been proposed 
that reorder applications’ requests at the memory controller, based on their access characteristics. 
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Figure 7.2: ASM-Mem: Fairness and performance 


Figureshows the fairness and performance of ASM-Mem, FRFCFS, PARES and TCM, for 
three core counts, averaged over 100 workloads for each core count. We draw three major obser¬ 
vations. First, ASM-Mem achieves better fairness than the three previously proposed scheduling 
policies, while achieving comparable/better performance. This is because ASM-Mem directly uses 
ASM’s slowdown estimates to allocate more bandwidth to highly slowed down applications, while 
previous works employ metrics such as memory intensity and row-buffer locality that are proxies 
for performance/slowdown. Second, ASM-Mem’s gains increase as the number of cores increases, 
achieving 5.5% and 12% improvement in fairness on the 8- and 16-core systems respectively. 
Third, we see fairness gains on systems with larger channel counts as well - 6% on a 16-core 
2-channel system (do not plots due to space constraints). We conclude that ASM-Mem is effective 
in mitigating interference between applications at the main memory, thereby improving fairness. 
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7.2.3 Combining ASM-Cache and ASM-Mem 


We combine ASM-Cache and ASM-Mem to build a coordinated cache-memory management 
scheme. ASM-Cache-Mem performs cache partitioning using ASM-Cache and conveys the slow¬ 
down estimated by ASM-Cache for each application (corresponding to its cache way allocation) to 
the memory controller. The memory controller uses these slowdowns to partition memory band¬ 
width across applications using ASM-Mem. Figure [73] compares ASM-Cache-Mem with combi¬ 
nations of FRFCFS, PARES and TCM with UCP, across 100 16-core workloads with 4MB shared 


cache and 1/2 memory channels. ASM-Cache-Mem improves fairness by 14.6%/8.8% on the 1/2 


channel systems respectively, compared to the fairest previous mechanism (FRFCFS-i-UCP), while 


achieving performance within 1% of the highest performing previous combination (PARBS-i-UCP). 


We conclude that ASM-Cache-Mem is effective in mitigating interference at both the shared cache 


and main memory, achieving significantly better fairness than combining previously proposed 


cache partitioning/memory scheduling schemes. 
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Figure 7.3: Combining ASM-Cache and ASM-Mem 


7.3 Providing Soft Slowdown Guarantees 


In a multi core system, multiple applications are consolidated on the same system. In such systems, 
ASM’s slowdown estimates can be leveraged to bound the application slowdowns. 
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Figure |7.4| shows the slowdowns of four applications in a workload using a naive cache allo¬ 
cation scheme and a slowdown-aware scheme based on ASM. The goal is to achieve a specified 
slowdown bound for the first application, h264ref. The Naive-QoS scheme, which is unaware of 
application slowdowns, allocates all caches ways to h264ref, the application of interest. ASM- 
QoS, on the other hand, allocates just enough cache ways to the application of interest, h264ref, 
such that a specific slowdown bound (indicated by X in ASM-QoS-X) is met. Such a scheme is 
enabled by ASM’s ability to estimate slowdowns for all possible cache allocations (Section [7T] ). 
Naive-QoS minimizes h264ref"s slowdown enabling it to meet any slowdown bound greater than 
2.17. However, it does so at the cost of slowing down other applications significantly. ASM- 
QoS, on the other hand, allocates just enough cache ways to h264ref such that it meets the spec¬ 
ified bound, while also reducing slowdowns for the other three applications, mcf, sphinxS and 
soplex, compared to Naive-QoS, thereby improving overall performance significantly (15%/20% 
for ASM-QoS-2.5/ASM-QoS-4 over Naive-QoS). 

Naive-QoS i i ASM-QoS-3.5 i 

ASM-QoS-2.5 I - 1 ASM-QoS-4 

ASM-QoS-3 



Figure 7.4: ASM-QoS: Slowdowns and performance 


This is one example policy that leverages ASM’s slowdown estimates to partition the shared 
cache capacity to achieve a specific slowdown bound. More sophisticated schemes can be built 
on top of ASM’s slowdown estimates that control the allocation of both memory bandwidth and 
cache capacity such that different applications’ slowdown bounds are met, while still achieving 
high overall system performance. We propose to explore such schemes as part of future work. 
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7.4 Fair Pricing in Cloud Systems 


Applications from different users eould be eonsolidated onto the same maehine in a eloud server 
cluster. Prieing sehemes in eloud systems bill users based on CPU core, memory/storage capaeity 
allocation and run length of a job [l4l|2l. However, they do not account for interference at the cache 
and main memory. For instanee, when two jobs A and B are run together on the same system, job A 
runs for three hours due to eaehe/memory interferenee from job B, but would have run for only an 
hour, had it been run alone. In this seenario, aeeurate slowdown estimates from ASM ean enable 
prieing based on how mueh an application is slowed down due to interferenee (espeeially sinee 
profiling every application to get alone run times is not feasible). In the example above, ASM 
would estimate job A’s slowdown to be 3x, enabling the user to be billed for only one hour, as 
against three hours with a seheme that bills based only on resouree alloeation and run time. 


7.5 Migration and Admission Control 

ASM’s slowdown estimates eould be leveraged by the system software to make migration and ad¬ 
mission eontrol deeisions. Previous works monitor different metries sueh as eaehe misses, memory 
bandwidth utilization across machines in a cluster and and migrate applieations across machines 
based on these metries HI 12117^ [9^ II 1811 . While sueh metries serve as proxies for interferenee, 
aeeurate slowdown estimates are a direet measure of the impact of interference on performanee. 
Henee, periodieally eommunieating slowdown estimates from ASM to the system software eould 
enable better migration deeisions. For instanee, the system software eould migrate applieations 
away from maehines on whieh slowdowns are very high or it eould perform admission eontrol and 
prevent new applieations from being seheduled on maehines where eurrently running applications 
are experiencing significant slowdowns. 
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7.6 Summary 


We present several use eases/meehanisms that ean leverage aeeurate slowdown estimates from 
ASM, towards different goals sueh as aehieving high fairness, system performanee and providing 
soft slowdown guarantees. Our evaluations show that several of these meehanisms improve fair- 
ness/performanee over state-of-the-art sehemes, thereby demonstrating the effeetiveness of ASM 
in enabling higher and more eontrollable performanee. We eonelude that ASM is a promising 
substrate that ean enable the design of effeetive meehanisms to estimate and eontrol applieation 
slowdowns in modem and future multieore systems. 
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Chapter 8 


Conclusions and Future Directions 

8.1 Conclusions 

In a multicore system, interferenee between applieations at shared resourees is a major ehallenge 
and degrades both overall system performanee and fairness and individual applieation perfor- 
manee. Furthermore, as we showed in Chapteran application’s performance varies depending 
on the eo-running applications and the amount of available shared resources in a system. 

Several previous works have tackled the problem of memory interferenee mitigation, with the 
goal of aehieving high performanee, with the prevalent direction being memory request sehedul- 
ing. State-of-the-art memory sehedulers rank individual applications with a total order based on 
their memory aceess eharaeteristies. Such a total order based ranking seheme inereases hardware 
eomplexity signifieantly, to the point that the seheduler eannot always meet the fast eommand 
seheduling requirements of state-of-the-art DDR protoeols. Furthermore, employing a total order 
ranking aeross individual applieations also eauses unfair applieation slowdowns, as we demon¬ 
strated in Chapter]^ 

We presented the Blaeklisting memory seheduler (BLISS) in Chapterj^ that taekles these short- 
eomings of previous sehedulers and aehieves high performanee and fairness, while ineurring low 
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hardware complexity. BLISS does so based on two new observations. First, it is sufficient to i) 
separate applications into only two groups, one containing applications that are vulnerable to in¬ 
terference and another containing applications that cause interference, and ii) prioritize requests 
of the vulnerable-to-interference group over the requests of the interference-causing group. Sec¬ 
ond, we observe the applications can be classified as interference-causing or vulnerable by simply 
monitoring the number of consecutive requests served from an application in a short time interval. 

While BLISS is able to achieve high performance, it does not tackle the problem of provid¬ 
ing performance guarantees in the presence of shared resource interference. Specifically, while 
BLISS mitigates application slowdowns, it cannot precisely quantify and control application slow¬ 
downs. Towards achieving the goal of quantifying and controlling slowdowns, we presented a 
model to accurately estimate application slowdowns in the presence of memory interference in 
Chapter]^ The Memory Interference induced Slowdown Estimation (MISE) model estimates ap¬ 
plication slowdowns based on the observation that a memory-bound application’s performance is 
roughly proportional to the rate at which its memory requests are served. This enables estimating 
slowdown as a ratio of request service rates. The alone-request-service-rate of an application can 
be estimated by giving its requests highest priority in accessing main memory. 

Accurate slowdown estimates from the MISE model can be leveraged to drive various hardware, 
software and hardware-software cooperative resource management techniques that strive to achieve 
high performance, fairness and provide performance guarantees. We demonstrate two such use 
cases of the MISE model in Chapter one that bounds the slowdown of critical applications in 
a workload, while also optimizing for overall system performance and another that minimizes 
slowdowns across all applications in a workload. 

The MISE model estimates slowdown accurately in the presence of main memory interference, 
but does not take into account contention for shared cache capacity. The Application Slowdown 
Model (ASM) that we presented in Chaptei[^takes into account shared cache capacity interference, 
in addition to memory bandwidth interference. ASM observes that an application’s performance 
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is roughly proportional to the rate at whieh it aeeesses the shared eaehe. This observation is more 
general than MISE’s observation that holds only for memory-bound applieations. ASM exploits 
this observation to estimate slowdown as a ratio of eaehe aeeess rates and estimates alone-eaehe- 
aeeess-rate by minimizing interferenee at the main memory and quantifying interferenee at the 
shared eaehe. Slowdown estimates from ASM ean enable various resouree management teeh- 
niques to manage both the shared eaehe and main memory. We diseuss and evaluate several sueh 
teehniques in Chapter |7} demonstrating the effeetiveness of ASM in estimating and eontrolling 
applieation slowdowns. 

We believe the meehanisms and models we proposed in this thesis eould have wide applieability 
both in terms of being effeetive in aehieving high and eontrollable performanee and also in terms 
of inspiring future researeh. Furthermore, beyond the models and meehanisms, the key prineiples 
and ideas that we eoneeive and employ in building our meehanisms eould have applieability and 
impaet in several different eontexts. Speeifieally, 

• The prineiple of using request serviee/aeeess rate as a proxy for performanee is a general 
observation that would hold in any elosed loop system. This prineiple ean be applied to 
manage eontention and estimate interferenee-indueed slowdowns in the eontext of other re- 
sourees sueh as storage and network too, besides at the shared eaehe and main memory. 

• The notion of aehieving interferenee mitigation by simply elassifying applieations into two 
groups, rather than employing a full ordered ranking aeross all applieations ean be applied 
in the eontext of managing eontention at other resourees too. 


8.2 Future Research Directions 

Our models, meehanisms and prineiples ean inspire future researeh in multiple different direetions. 
We deseribe some potential researeh direetions in the next seetions. 
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8.2.1 Leveraging Slowdown Estimates for Cluster Management 

Slowdown estimates from our models ean be leveraged to drive various resource management poli¬ 
cies. The hardware resource management policies that we presented and evaluated in Chapters 
and|^partition resources such as caches and main memory bandwidth on a single node. Especially 
in the context of providing soft slowdown guarantees, such a node-level management policy might 
not be able to meet the required slowdown bounds/performance requirements. In this scenario, 
communicating the slowdown estimates to the system software/hypervisor can enable various ap¬ 
plication/virtual machine migration and admission control policies. 

We have built one such policy that employs a simple linear model that relates performance of an 
application to its memory bandwidth consumption to detect contention and drive virtual machine 
migration decisions in our VEE 2015 paper [I118H (part of Hui Wang’s PhD thesis). This policy 
strives to achieve high performance. 

We believe there is ample scope to explore and build several more virtual cluster management 
policies that exploit slowdown estimates to achieve various different goals such as meeting per¬ 
formance guarantees, improving system fairness etc. both in real systems and in the simulation 
realm. Eor instance, we were relatively constrained by what counters are available in existing sys¬ 
tems when designing our model and virtual machine migration policy. The new Haswell machines 
provide more counters and support for monitoring and managing the shared cache capacity. This 
would enable more effective cluster management policies and would also enable combining cluster 
management policies with resource allocation policies. 

8.2.2 Performance Guarantees in Heterogeneous Systems 

While the focus of this thesis was on providing soft performance guarantees in the context of homo¬ 
geneous multicore systems, meeting different kinds of performance requirements in heterogeneous 
systems with different kinds of agents is an important research problem. We have explored this 
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problem in the eontext of SoC systems with different agents such as CPU cores, GPUs and hard¬ 
ware accelerators uni. Our goal in this work is to meet the deadlines/frame rate requirements of 
hardware accelerators and GPUs, while still achieving high CPU performance. 

We believe there are several interesting and unsolved problems and challenges in this space. 
For instance, different agents could have different kinds of performance requirements in a het¬ 
erogeneous system. Some agents might need to meet deadlines, whereas other agents might have 
requirements on resources such as memory bandwidth, while agents such as CPU cores might have 
slowdown/latency requirements. 

The design of a memory system and a memory controller that is able to take into account the 
different and often, conflicting requirements of different agents and applications is a significant 
challenge. Furthermore, building a memory system that can take in such requirements and strive 
to meet them in a general manner and not be limited by the specific configuration of the agents and 
the system is an even bigger challenge. We believe this is a rich area with ample scope for future 
exploration. 

8.2.3 Integration of Memory Interference Mitigation Techniques 

The main focus of this thesis was on mitigating and quantifying memory interference with the goal 
of building better resource management and allocation techniques. This thesis focused heavily 
on memory request scheduling to perform slowdown estimation and resource management (e.g., 
BLISS, MISE and the mechanisms built on top of MISE). However, as mentioned in Section [2^ 
there are several other approaches to mitigate memory interference. Eor instance, we have explored 
memory channel partitioning [[841 . Other previous works have proposed bank partitioning IfSOl |7TJ 
11221 . interleaving [|5^ . source throttling [|T71I^I^|901 . 

These different approaches could be effectively combined together rather than relying on one 
approach, to address the memory interference problem, as we discuss in our papers describing the 
challenges in the main memory system |l85l[88l- Eor instance, channel partitioning maps applica- 
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tions’ data onto different ehannels depending on their aeeess patterns. In this eontext, the memory 
request seheduling poliey eould be tailored to better mateh the aeeess eharaeteristies of the speeifie 
applieations that are mapped to the ehannel. We briefly explored this idea in our work on memory 
ehannel partitioning Il84l (part of Sai Prashanth Muralidhara’s thesis). However, there is plenty 
of seope to explore this further. In faet, this eould lead to the notion of programmable memory 
eontrollers, where the memory request seheduling poliey ean be tuned at run time, depending on 
the workload. 

The address interleaving poliey heavily influenees the row-buffer loeality and bank-level paral¬ 
lelism of different applieations’ aeeesses. Henee, eo-design of the address interleaving poliey along 
with the memory seheduling poliey ean enable a memory eontroller design that is more amenable 
to the aeeess eharaeteristies of different applieations, given a speeifie address interleaving poliey. 

We expand more on some of these ehallenges in [|Ml US. We elassify resouree management 
teehniques into dumb vs. smart resouree teehniques. Dumb resourees do not have the intelligenee 
to manage themselves and rely on a eentralized agent to manage and alloeate them. Smart re¬ 
sourees have the intelligenee to manage and eontrol their own alloeation. We diseuss the trade-offs 
involved in the design and effeetiveness of these different kinds of teehniques and the ehallenges 
in eombining them effeetively. 

The interaetions between these different memory interferenee mitigation teehniques offer a wide 
range of different ehoiees and opportunities to design a memory system that leverages these dif¬ 
ferent degrees of freedom in a synergistie manner. Henee, we believe this is an important and 
promising direetion for future exploration. 

8.2.4 Resource Management for Multithreaded Applications 

This thesis foeused predominantly on estimating and managing eontention between multiple single- 
threaded applieations when they are run together on a system. The problem of managing eontention 
between different threads in a multithreaded applieation and between multiple multithreaded ap- 
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plications is an important challenge. 

Multiple threads in a multithreaded applieation work towards aehieving a eommon goal. This 
is a different seenario than when multiple eompeting single threaded applieations, eaeh with a 
different goal, are run together on a system. Although the different threads work towards the same 
goal, it is still important to apportion resourees aeeordingly among the different threads sueh that 
the threads that lag the most are given more resourees. 

There are several different unaddressed researeh ehallenges in this spaee. For instanee, aeeu- 
rately estimating how mueh eaeh thread is slowed down is an important aspeet of determining 
the amount of progress made by eaeh thread. Onee aeeurate metries are developed to eapture the 
progress of multithreaded applieations, they ean be leveraged to build resouree alloeation polieies 
that partition resouree among different threads of a multithreaded applieation. Furthermore, man¬ 
aging resourees between multiple multithreaded applieations is yet another important and promis¬ 
ing researeh area that has not been explored mueh. 

8.2.5 Coordinated Management of Main Memory and Storage 

The primary foeus of this thesis was on managing main memory and shared eaehes. However, the 
storage system is an important eomponent that needs to be taken into aeeount in order to build 
a eomprehensive resouree management substrate. While there has been a large body of work on 
storage QoS, as we deseribe in Seetion [2^ the interaetions between memory bandwidth, memory 
eapaeity and storage bandwidth have not been explored mueh. Our ideas on eoordinated shared 
eaehe eapaeity and memory bandwidth management eould potentially be leveraged in the eontext 
of main memory eapaeity and storage bandwidth. Furthermore, our observations on request serviee 
rate eorrelating linearly with performanee eould be leveraged to estimate progress and performanee 
in the eontext of other resourees sueh as the storage bandwidth. 

In the past deeade, there has been a proliferation of storage elass non-volatile memory teeh- 
nologies sueh as flash and phase ehange memory. In this eontext, the notion of how long storage 


118 




accesses take ehanges. An applieation might potentially not need to be eontext switehed on a page 
fault. Furthermore, sueh fast non-volatile memory teehnologies provide the opportunity to manage 
the DRAM main memory and the NVM storage system, as a single address spaee, as deseribed by 
Meza et al. in [TTSlI . Coordinated management of main memory and sueh fast storage teehnologies, 
in light of these advaneements, presents new and rieh opportunities. Furthermore, the eoordinated 
management of main memory and storage also opens up opportunities for more hardware-software 
eooperative solutions, sinee both the hardware and software layers need to be involved for effee- 
tive management of the main memory and storage in a eoordinated manner. We believe there are 
several very important and intriguing ehallenges in this spaee. 

8.2.6 Comprehensive Slowdown Estimation 

Estimating slowdown due to eontention at all resourees in a system enables understanding the im- 
paet of eontention at all resourees in a eomprehensive manner and eonsequently, the management 
of these different resourees. The previous seetion on expanding our work to inelude the storage 
system was in this spirit. However, other resourees sueh as the on-ehip intereonneet, the off-ehip 
network should be taken into aeeount to build a eomprehensive slowdown estimation model. 

Our prineiples on request serviee rate eorrelating with performanee ean potentially be used to 
estimate slowdown due to different shared resourees. However, aeeess eharaeteristies and bot- 
tleneek behavior at these different resourees eould be different, providing ample seope for new 
insights and ideas in this spaee. Furthermore, onee slowdown estimates are available, manag¬ 
ing a large set of resourees and speeifieally, doing do in a eoordinated manner are important and 
ehallenging problems. 
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